Skip to content

Conversation

@ivanprado
Copy link
Contributor

@ivanprado ivanprado commented Dec 10, 2021

WARNING: This was developed on top of #56. Merge it only after it.

  • scrapy startproject is modified so that the project is prepared for scrapy-poet. Also folders for the Page Objects and their tests and fixtures are created.
  • scrapy override creates a Page Object and a test case over a web page. This makes development handy.
  • Templates system that allows customizing the generated Page Objects and tests code.
  • The override command can also be used to update the fixture data with fresh web data. Also, it can be used if the dependencies of a Page Object has changed: in this case, running the command is required to fetch additional fixtures to the additional dependencies.

TODO

  • Improve the code structure
  • Improve error messages
  • Documentation

Remaining work for the future:

  • There is no way to do garbage collection over the unused fixtures.

How can be the documentation structured

Rewrite the tutorial using the new startproject and override commands. The goal should be to create a generic spider with common crawling logic and then integrate different sites. The spider could for example extract books from categories in book review pages. The structure could be:

  1. Explanation of what we are going to show: extract books from different sites with different layouts, but keeping a common crawling logic. Enumerate the different steps.
  2. Creating a spider
    2.1 Create a new project using startproject
    2.2 Writing a spider that rely on Page Objects (empty implementation)
    2.3 Create the first override using the tool
    2.3.1 Explain the handle_url decorator and link to web_poet documentation and url-matcher doc
    2.4. Implement extraction logic in the PO
    2.5 Use the unit test to check that the logic is right
    2.6. Do the same for the rest of PO for the site
    2.7. Run the spider
    2.8 Integrate the second site
    2.9 Summary of what happen
  3. Rerunning the overide command over the same PO and URL. When and why:
    3.1. To get fresh data. e.g., because the layout of the site changed and we need to update the extraction code
    3.2. Under the presence of new dependencies in the PO. It will be required to fetch the new resources.
  4. Templates. What they are and how to modify them.
    4.1 Default templates vs specific ones
  5. Listing the Page Objets using python -m web_poet
  6. Existing Pages/class that can be used:
    6.1. ItemPage
    6.2. ItemWebPage
    6.3. RequestData
    6.4. Injectable

Keep in mind that the tutorial will be the entry point for many people. It is really important to have a tutorial that is good, simple and and convinces of the value.

@ivanprado ivanprado changed the title [WIP] startproject and override command line tool for Page Objects development startproject and override command line tool for Page Objects development Dec 13, 2021
Copy link

@sortafreel sortafreel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @ivanprado 👍 I've left a couple of comments here and there :)

po_path=po_path,
test_path=test_path,
)
self.context = context

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here. Should we maybe init self.context and self.po_path with all the typing before assigning any values to them inside the methods?

print("Fixture saved successfully")

self.po_test_path = generate_test(self.context)
print()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we stick with print instead of logging?

Base automatically changed from url-matcher-integration to master May 19, 2022 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants