Flask Web Scraper 🕷️

A powerful, user-friendly web scraping application built with Flask that allows you to extract data from websites with ease. Features both specialized Hacker News scraping and generic website scraping with customizable CSS selectors.

🌟 Features

Core Functionality

🔥 Hacker News Scraper: Enhanced scraper based on original code with vote filtering
🌐 Generic Website Scraper: Configurable CSS selectors for any website
📊 Real-time Data Preview: Interactive table format with pagination
🐼 Pandas Integration: Advanced data processing capabilities
📁 Export Options: CSV and JSON export functionality
💎 Modern UI: Glass morphism design with responsive layout
⚡ Real-time Feedback: Loading states and progress indicators

Advanced Features

🎯 Dynamic Configuration: Adapts based on scraping type
✅ CSS Selector Validation: Built-in helpers and validation
📈 Data Analysis: Automatic summarization and statistics
📱 Mobile Responsive: Works seamlessly on all devices
🔄 Session Management: Persistent data for exports
🛡️ Error Handling: Robust error recovery and user feedback
🎨 Interactive Elements: Hover effects and smooth animations

🚀 Quick Start

Prerequisites

Python 3.8 or higher
pip package manager

Installation

Clone the repository

git clone https://github.com/devmab24/flask-web-scraper.git
cd flask-web-scraper

Create and activate virtual environment

# Windows
python -m venv venv
venv\Scripts\activate

# Linux/macOS
python -m venv venv
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```
Create templates folder (if not present)
```
mkdir templates
```
Run the application
```
python app.py
```
Open your browser Navigate to http://localhost:5000

📋 Requirements

Create a requirements.txt file with the following dependencies:

Flask==2.3.0
requests==2.31.0
beautifulsoup4==4.12.0
pandas==2.0.0
lxml==4.9.0
gunicorn==20.1.0

🎯 Usage Guide

Hacker News Scraping

Select Hacker News mode (default)
Configure parameters:
- Pages: Number of pages to scrape (1-10)
- Minimum Votes: Filter stories below vote threshold
Click "Start Scraping"
View results in the interactive table
Export data as CSV or JSON

Example Output:

{
  "title": "Revolutionary AI Breakthrough",
  "link": "https://example.com/article",
  "votes": 250,
  "comments": 45,
  "author": "tech_user",
  "scraped_at": "2024-01-15 10:30:45"
}

Generic Website Scraping

Select "Custom Website" mode
Enter target URL
Configure CSS selectors:
- Container: Main wrapper for each item (e.g., .article, .post)
- Title: Article title selector (e.g., h2, .headline)
- Link: URL selector (href will be extracted)
- Description: Summary/excerpt selector
- Custom Fields: Add any additional fields
Set maximum items to scrape
Click "Start Scraping"
Review and export results

CSS Selector Examples:

/* Container */
.article, .post, div.item

/* Title */
h1, h2, .title, .headline

/* Links */
a.read-more, .post-link

/* Content */
.summary, .excerpt, p.description

🏗️ Project Structure

flask-web-scraper/
│
├── app.py                 # Main Flask application
├── requirements.txt       # Python dependencies
├── README.md             # This file
├── .gitignore           # Git ignore rules
├── Procfile             # Heroku deployment config
├── Dockerfile           # Docker configuration
│
├── templates/
│   └── index.html       # Main HTML template
│
└── static/              # Static assets (if any)
    ├── css/
    ├── js/
    └── images/

🔧 Configuration

Environment Variables

Create a .env file for configuration:

SECRET_KEY=your-super-secret-key-here
FLASK_ENV=development
FLASK_DEBUG=True

Scraper Settings

Modify these constants in app.py:

# Default settings
DEFAULT_PAGES = 2
DEFAULT_MIN_VOTES = 99
DEFAULT_MAX_ITEMS = 50
REQUEST_TIMEOUT = 15
MAX_PAGES = 10

🚀 Deployment

Heroku Deployment

Create Procfile:
```
web: gunicorn app:app
```

Deploy:

heroku create your-app-name
git push heroku main

Docker Deployment

Build image:
```
docker build -t flask-scraper .
```
Run container:
```
docker run -p 5000:5000 flask-scraper
```

Traditional Server

Use any WSGI server like Gunicorn:

gunicorn --bind 0.0.0.0:5000 app:app

🛠️ API Reference

Scraping Endpoint

POST /scrape

Request Body:

{
  "scrape_type": "hackernews|generic",
  "pages": 2,
  "min_votes": 99,
  "url": "https://example.com",
  "max_items": 50,
  "css_selectors": {
    "container": ".article",
    "title": "h2",
    "link_href": "a"
  }
}

Response:

{
  "success": true,
  "data": [...],
  "total_items": 25,
  "columns": ["title", "link", "votes"],
  "summary": {
    "scraped_at": "2024-01-15 10:30:45",
    "total_items": 25,
    "scrape_type": "hackernews"
  }
}

Export Endpoints

GET /export/csv - Download CSV format
GET /export/json - Download JSON format

🧪 Testing

Manual Testing Checklist

Hacker News scraping with different vote thresholds
Generic website scraping with various selectors
CSV export functionality
JSON export functionality
Mobile responsiveness
Error handling (invalid URLs, network issues)
Large dataset handling

Test Sites for Generic Scraping

News Sites: BBC, Reuters, TechCrunch
Blogs: Medium articles, WordPress sites
E-commerce: Product listings (respect robots.txt)
Forums: Reddit, Stack Overflow

🤝 Contributing

We welcome contributions! Here's how to get started:

Development Setup

Fork the repository
Create feature branch:
```
git checkout -b feature/amazing-feature
```
Make changes and test thoroughly

Commit with clear messages:

git commit -m "Add: New CSS selector validation"

Push to branch:
```
git push origin feature/amazing-feature
```
Create Pull Request

Contribution Guidelines

Code Style: Follow PEP 8 for Python, use meaningful variable names
Documentation: Update README and add docstrings
Testing: Test your changes thoroughly
Security: Never commit secrets or API keys

Areas for Contribution

🎯 High Priority

Unit tests with pytest
Rate limiting and request throttling
Database storage for scraped data
User authentication and personal dashboards
Scheduled scraping with cron jobs

🔧 Medium Priority

Support for JavaScript-rendered sites (Selenium)
Advanced filtering and search functionality
Data visualization with charts
API documentation with Swagger
Webhook notifications

🎨 Enhancement Ideas

Dark/light theme toggle
Scraping templates for popular sites
Bulk URL processing
Data comparison between scrapes
Browser extension for quick scraping

Bug Reports

When reporting bugs, include:

Environment details (OS, Python version, browser)
Steps to reproduce
Expected vs actual behavior
Screenshots (if applicable)
Error logs/console output

📚 Advanced Usage

Custom CSS Selectors Guide

Basic Selectors

/* Element */
div, p, h1

/* Class */
.article, .post, .content

/* ID */
#main-content

/* Attribute */
[data-testid="article"]

Advanced Selectors

/* Descendant */
.article p

/* Child */
.article > h2

/* Nth-child */
.post:nth-child(2)

/* Contains text */
a[href*="article"]

Handling Dynamic Content

For sites with JavaScript-rendered content, consider:

Selenium WebDriver integration
Requests-HTML for basic JS support
API endpoints if available

Performance Optimization

Use session reuse for multiple requests
Implement caching for repeated scrapes
Add delays to respect server load
Use connection pooling

🐛 Troubleshooting

Common Issues

Issue: "Connection timeout" Solution: Check internet connection, try different user agent

Issue: "No data found" Solution: Verify CSS selectors with browser dev tools

Issue: "Template error with integers" Solution: This is fixed in the latest version

Issue: "Export not working" Solution: Ensure you've scraped data first

Debug Mode

Enable debug mode for detailed error messages:

app.run(debug=True)

📄 License

This project is licensed under the MIT License - see below for details:

MIT License

Copyright (c) 2024 Flask Web Scraper

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

🙏 Acknowledgments

Beautiful Soup for HTML parsing
Flask for the web framework
Pandas for data manipulation
Bootstrap for responsive design
Font Awesome for icons
Hacker News for inspiration

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [email protected]

🔮 Roadmap

Version 2.0 (Planned)

User authentication system
Database integration (PostgreSQL/MongoDB)
RESTful API with authentication
Scheduling system for automated scraping
Advanced data visualization

Version 2.5 (Future)

Machine learning for content extraction
Multi-language support
Cloud deployment templates (AWS, GCP, Azure)
Performance monitoring dashboard

⭐ Star this repository if it helped you!

🔄 Fork it to contribute or customize for your needs

Built with ❤️ by the community

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
templates		templates
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
scrape.py		scrape.py

devmab24/web_scraper_app

Folders and files

Latest commit

History

Repository files navigation