A powerful, user-friendly web scraping application built with Flask that allows you to extract data from websites with ease. Features both specialized Hacker News scraping and generic website scraping with customizable CSS selectors.
- ๐ฅ Hacker News Scraper: Enhanced scraper based on original code with vote filtering
- ๐ Generic Website Scraper: Configurable CSS selectors for any website
- ๐ Real-time Data Preview: Interactive table format with pagination
- ๐ผ Pandas Integration: Advanced data processing capabilities
- ๐ Export Options: CSV and JSON export functionality
- ๐ Modern UI: Glass morphism design with responsive layout
- โก Real-time Feedback: Loading states and progress indicators
- ๐ฏ Dynamic Configuration: Adapts based on scraping type
- โ CSS Selector Validation: Built-in helpers and validation
- ๐ Data Analysis: Automatic summarization and statistics
- ๐ฑ Mobile Responsive: Works seamlessly on all devices
- ๐ Session Management: Persistent data for exports
- ๐ก๏ธ Error Handling: Robust error recovery and user feedback
- ๐จ Interactive Elements: Hover effects and smooth animations
- Python 3.8 or higher
- pip package manager
-
Clone the repository
git clone https://github.com/devmab24/flask-web-scraper.git cd flask-web-scraper -
Create and activate virtual environment
# Windows python -m venv venv venv\Scripts\activate # Linux/macOS python -m venv venv source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Create templates folder (if not present)
mkdir templates
-
Run the application
python app.py
-
Open your browser Navigate to
http://localhost:5000
Create a requirements.txt file with the following dependencies:
Flask==2.3.0
requests==2.31.0
beautifulsoup4==4.12.0
pandas==2.0.0
lxml==4.9.0
gunicorn==20.1.0- Select Hacker News mode (default)
- Configure parameters:
- Pages: Number of pages to scrape (1-10)
- Minimum Votes: Filter stories below vote threshold
- Click "Start Scraping"
- View results in the interactive table
- Export data as CSV or JSON
Example Output:
{
"title": "Revolutionary AI Breakthrough",
"link": "https://example.com/article",
"votes": 250,
"comments": 45,
"author": "tech_user",
"scraped_at": "2024-01-15 10:30:45"
}-
Select "Custom Website" mode
-
Enter target URL
-
Configure CSS selectors:
- Container: Main wrapper for each item (e.g.,
.article,.post) - Title: Article title selector (e.g.,
h2,.headline) - Link: URL selector (href will be extracted)
- Description: Summary/excerpt selector
- Custom Fields: Add any additional fields
- Container: Main wrapper for each item (e.g.,
-
Set maximum items to scrape
-
Click "Start Scraping"
-
Review and export results
CSS Selector Examples:
/* Container */
.article, .post, div.item
/* Title */
h1, h2, .title, .headline
/* Links */
a.read-more, .post-link
/* Content */
.summary, .excerpt, p.descriptionflask-web-scraper/
โ
โโโ app.py # Main Flask application
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โโโ .gitignore # Git ignore rules
โโโ Procfile # Heroku deployment config
โโโ Dockerfile # Docker configuration
โ
โโโ templates/
โ โโโ index.html # Main HTML template
โ
โโโ static/ # Static assets (if any)
โโโ css/
โโโ js/
โโโ images/
Create a .env file for configuration:
SECRET_KEY=your-super-secret-key-here
FLASK_ENV=development
FLASK_DEBUG=TrueModify these constants in app.py:
# Default settings
DEFAULT_PAGES = 2
DEFAULT_MIN_VOTES = 99
DEFAULT_MAX_ITEMS = 50
REQUEST_TIMEOUT = 15
MAX_PAGES = 10-
Create Procfile:
web: gunicorn app:app -
Deploy:
heroku create your-app-name git push heroku main
-
Build image:
docker build -t flask-scraper . -
Run container:
docker run -p 5000:5000 flask-scraper
Use any WSGI server like Gunicorn:
gunicorn --bind 0.0.0.0:5000 app:appPOST /scrape
Request Body:
{
"scrape_type": "hackernews|generic",
"pages": 2,
"min_votes": 99,
"url": "https://example.com",
"max_items": 50,
"css_selectors": {
"container": ".article",
"title": "h2",
"link_href": "a"
}
}Response:
{
"success": true,
"data": [...],
"total_items": 25,
"columns": ["title", "link", "votes"],
"summary": {
"scraped_at": "2024-01-15 10:30:45",
"total_items": 25,
"scrape_type": "hackernews"
}
}- GET
/export/csv- Download CSV format - GET
/export/json- Download JSON format
- Hacker News scraping with different vote thresholds
- Generic website scraping with various selectors
- CSV export functionality
- JSON export functionality
- Mobile responsiveness
- Error handling (invalid URLs, network issues)
- Large dataset handling
- News Sites: BBC, Reuters, TechCrunch
- Blogs: Medium articles, WordPress sites
- E-commerce: Product listings (respect robots.txt)
- Forums: Reddit, Stack Overflow
We welcome contributions! Here's how to get started:
-
Fork the repository
-
Create feature branch:
git checkout -b feature/amazing-feature
-
Make changes and test thoroughly
-
Commit with clear messages:
git commit -m "Add: New CSS selector validation" -
Push to branch:
git push origin feature/amazing-feature
-
Create Pull Request
- Code Style: Follow PEP 8 for Python, use meaningful variable names
- Documentation: Update README and add docstrings
- Testing: Test your changes thoroughly
- Security: Never commit secrets or API keys
- Unit tests with pytest
- Rate limiting and request throttling
- Database storage for scraped data
- User authentication and personal dashboards
- Scheduled scraping with cron jobs
- Support for JavaScript-rendered sites (Selenium)
- Advanced filtering and search functionality
- Data visualization with charts
- API documentation with Swagger
- Webhook notifications
- Dark/light theme toggle
- Scraping templates for popular sites
- Bulk URL processing
- Data comparison between scrapes
- Browser extension for quick scraping
When reporting bugs, include:
- Environment details (OS, Python version, browser)
- Steps to reproduce
- Expected vs actual behavior
- Screenshots (if applicable)
- Error logs/console output
/* Element */
div, p, h1
/* Class */
.article, .post, .content
/* ID */
#main-content
/* Attribute */
[data-testid="article"]/* Descendant */
.article p
/* Child */
.article > h2
/* Nth-child */
.post:nth-child(2)
/* Contains text */
a[href*="article"]For sites with JavaScript-rendered content, consider:
- Selenium WebDriver integration
- Requests-HTML for basic JS support
- API endpoints if available
- Use session reuse for multiple requests
- Implement caching for repeated scrapes
- Add delays to respect server load
- Use connection pooling
Issue: "Connection timeout" Solution: Check internet connection, try different user agent
Issue: "No data found" Solution: Verify CSS selectors with browser dev tools
Issue: "Template error with integers" Solution: This is fixed in the latest version
Issue: "Export not working" Solution: Ensure you've scraped data first
Enable debug mode for detailed error messages:
app.run(debug=True)This project is licensed under the MIT License - see below for details:
MIT License
Copyright (c) 2024 Flask Web Scraper
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- Beautiful Soup for HTML parsing
- Flask for the web framework
- Pandas for data manipulation
- Bootstrap for responsive design
- Font Awesome for icons
- Hacker News for inspiration
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
- User authentication system
- Database integration (PostgreSQL/MongoDB)
- RESTful API with authentication
- Scheduling system for automated scraping
- Advanced data visualization
- Machine learning for content extraction
- Multi-language support
- Cloud deployment templates (AWS, GCP, Azure)
- Performance monitoring dashboard
โญ Star this repository if it helped you!
๐ Fork it to contribute or customize for your needs
Built with โค๏ธ by the community