About my experience with Python3: Scrapy, Playwright, Selenium, Requests, BS4, requests-html etc.

Sites that are larger and more complex pose a greater challenge, but this is not an issue. If they cannot be accessed using simple HTTPS requests with Scrapy, we can configure Playwright within the project. Additionally, there are alternative options such as Splash and Selenium. For more intricate and robust websites, we may choose to build a project exclusively using Selenium, as demonstrated in the project below:

Selenium app for SEO Analysis This project has been configured with a web driver using Selenium. I have implemented more complex logic here, including anti-bot detection and proxy configuration. Browser Config for Selenium

And video presentation with Selenium Headless app:

The extensive Scrapy project I created for peviitor.ro can be found at: Scrapy full automated project In this project, I configured Scrapy for crawling, processing items, and updating API data on peviitor.ro. Scrapy is the best choice for large-scale scraping projects due to its ease of use. Even if someone else works on your project, they will easily understand the configuration.

Scrapy project video presentation:

My another project is for Peviitor.ro: a Job Search Engine. You can find the custom build for scraping at: Peviitor Project - Custom build for Scraping This section provides insights into my custom solution for scraping and is designed for new contributors. It's straightforward to create a new template for scraping within this project. Users will find it easy to create new scrapers:

You can create new files for scrapers from -> ... your terminal. For example: python3 __create_scraper.py "name_scraper" "link" "type_scraper" Its really useful when you have a lot of scrapers.

You can create your own scraper:

... static

... dynamic_json_get

... dynamic_json_post

... dynamic_render

... custom

File for create scraping template Folder with __utils files for automate scraping process

Custom scraping project video presentation:

In this project, I opted not to use Selenium due to the limitation of GitHub Actions, which provides only 4GB of RAM for each virtual machine. Instead, I employed requests_html, a lightweight browser emulator that proved sufficient for our requirements.

I implemented the Singleton Pattern in this project to streamline the workflow for scrapers. The feedback from the 5 students who have used it is very positive. They only need to create an instance, and the results are conveniently available within that instance. There is no need to invoke additional magic methods.

Little example from Scrapy

Notes about Scraping:

In the web scraping process, the first step involves searching for a REST API or SOAP service on the site. If these are not available, the next step is to parse the HTML. In the case of dynamic pages rendered with JavaScript, we explore the possibility of finding an XML file associated with the site. However, it's crucial to compare the .xml file with the site's category; if they match, we proceed with scraping. Note that the XML file may not be instantly updated with changes in the site's category.

If none of these methods proves effective, we resort to using Playwright or Selenium to interact directly with the web page.

Pipeline

First tests in "scrapy shell"

I have extensively worked with Scrapy, a Python web scraping framework, to extract and analyze data from various websites. Leveraging its flexible architecture and powerful features, I have successfully developed efficient web scraping solutions for diverse projects.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
amazon_spider_test		amazon_spider_test
project_photo		project_photo
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About my experience with Python3: Scrapy, Playwright, Selenium, Requests, BS4, requests-html etc.

Little example from Scrapy

Notes about Scraping:

Pipeline

First tests in "scrapy shell"

About

Uh oh!

Releases

Packages

Uh oh!

Languages

andreireporter13/ScrapyCrawlers

Folders and files

Latest commit

History

Repository files navigation

About my experience with Python3: Scrapy, Playwright, Selenium, Requests, BS4, requests-html etc.

Little example from Scrapy

Notes about Scraping:

Pipeline

First tests in "scrapy shell"

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages