|
| 1 | +.. _`intro-overrides`: |
| 2 | + |
| 3 | +Overrides |
| 4 | +========= |
| 5 | + |
| 6 | +Overrides contains mapping rules to associate which URLs a particular |
| 7 | +Page Object would be used. The URL matching rules is handled by another library |
| 8 | +called `url-matcher <https://url-matcher.readthedocs.io>`_. |
| 9 | + |
| 10 | +Using such matching rules establishes the core concept of Overrides wherein |
| 11 | +its able to use specific Page Objects in lieu of the original one. |
| 12 | + |
| 13 | +This enables ``web-poet`` to be used effectively by other frameworks like |
| 14 | +`scrapy-poet <https://scrapy-poet.readthedocs.io>`_. |
| 15 | + |
| 16 | +Example Use Case |
| 17 | +---------------- |
| 18 | + |
| 19 | +Let's explore an example use case for the Overrides concept. |
| 20 | + |
| 21 | +Suppose we're using Page Objects for our broadcrawl project which explores |
| 22 | +eCommerce websites to discover product pages. It wouldn't be entirely possible |
| 23 | +for us to create parsers for all websites since we don't know which sites we're |
| 24 | +going to crawl beforehand. |
| 25 | + |
| 26 | +However, we could at least create a generic Page Object to support parsing of |
| 27 | +some fields in well-known locations of product information like ``<title>``. |
| 28 | +This enables our broadcrawler to at least parse some useful information. Let's |
| 29 | +call such Page Object to be ``GenericProductPage``. |
| 30 | + |
| 31 | +Assuming that one of our project requirements is to fully support parsing of the |
| 32 | +`top 3 eCommerce websites`, then we'd need to create a Page Object for each one |
| 33 | +to parse more specific fields. |
| 34 | + |
| 35 | +Here's where the Overrides concept comes in: |
| 36 | + |
| 37 | + 1. The ``GenericProductPage`` is used to parse all eCommerce product pages |
| 38 | + `by default`. |
| 39 | + 2. Whenever one of our declared URL rules matches with a given page URL, |
| 40 | + then the Page Object associated with that rule `overrides (or replaces)` |
| 41 | + the default ``GenericProductPage``. |
| 42 | + |
| 43 | +This enables us to fine tune our parsing logic `(which are abstracted away for |
| 44 | +each Page Object)` depending on the page we're parsing. |
| 45 | + |
| 46 | +Let's see this in action by creating Page Objects below. |
| 47 | + |
| 48 | + |
| 49 | +Creating Overrides |
| 50 | +------------------ |
| 51 | + |
| 52 | +Let's take a look at how the following code is structured: |
| 53 | + |
| 54 | +.. code-block:: python |
| 55 | +
|
| 56 | + from web_poet import handle_urls |
| 57 | + from web_poet.pages import ItemWebPage |
| 58 | +
|
| 59 | + class GenericProductPage(ItemWebPage): |
| 60 | + def to_item(self): |
| 61 | + return {"product title": self.css("title::text").get()} |
| 62 | +
|
| 63 | + @handle_urls("example.com", overrides=GenericProductPage) |
| 64 | + class ExampleProductPage(ItemWebPage): |
| 65 | + def to_item(self): |
| 66 | + ... # more specific parsing |
| 67 | +
|
| 68 | + @handle_urls("anotherexample.com", overrides=GenericProductPage, exclude="/digital-goods/") |
| 69 | + class AnotherExampleProductPage(ItemWebPage): |
| 70 | + def to_item(self): |
| 71 | + ... # more specific parsing |
| 72 | +
|
| 73 | + @handle_urls(["dualexample.com", "dualexample.net"], overrides=GenericProductPage) |
| 74 | + class DualExampleProductPage(ItemWebPage): |
| 75 | + def to_item(self): |
| 76 | + ... # more specific parsing |
| 77 | +
|
| 78 | +The code above declares that: |
| 79 | + |
| 80 | + - For sites that matches the ``example.com`` pattern, ``ExampleProductPage`` |
| 81 | + would be used instead of ``GenericProductPage``. |
| 82 | + - The same is true for ``YetAnotherExampleProductPage`` where it is used |
| 83 | + instead of ``GenericProductPage`` for two URLs: ``dualexample.com`` and |
| 84 | + ``dualexample.net``. |
| 85 | + - However, ``AnotherExampleProductPage`` is only used instead of ``GenericProductPage`` |
| 86 | + when we're parsing pages from ``anotherexample.com`` which doesn't contain |
| 87 | + ``/digital-goods/`` in its URL path. |
| 88 | + |
| 89 | +The override mechanism that ``web-poet`` offers could also still be further |
| 90 | +customized. You can read some of the specific parameters and alternative ways |
| 91 | +to organize the rules via the :ref:`Overrides API section <api-overrides>`. |
| 92 | + |
| 93 | + |
| 94 | +Viewing all available Overrides |
| 95 | +------------------------------- |
| 96 | + |
| 97 | +A convenience function is available discover and retrieve all rules from your |
| 98 | +project. Make sure to check out :ref:`Overrides API section <api-overrides>` |
| 99 | +to see the other functionalities of ``find_page_object_overrides``. |
| 100 | + |
| 101 | +.. code-block:: |
| 102 | +
|
| 103 | + from web_poet import find_page_object_overrides |
| 104 | +
|
| 105 | + rules = find_page_object_overrides("my_project.page_objects") |
| 106 | +
|
| 107 | + print(len(rules)) # 3 |
| 108 | +
|
| 109 | + print(rules[0]) # OverrideRule(for_patterns=Patterns(include=['example.com'], exclude=[], priority=500), use=<class 'my_project.page_objects.ExampleProductPage'>, instead_of=<class 'my_project.page_objects.GenericProductPage'>, meta={}) |
| 110 | +
|
| 111 | +
|
| 112 | +A handy CLI tool is also available at your disposal to quickly see the available |
| 113 | +Override rules in a given module in your project. For example, invoking something |
| 114 | +like ``web_poet my_project.page_objects`` would produce the following: |
| 115 | + |
| 116 | +.. code-block:: |
| 117 | +
|
| 118 | + Use this instead of for the URL patterns except for the patterns with priority meta |
| 119 | + ---------------------------------------------------- ------------------------------------------ -------------------------------------- ------------------------- --------------- ------ |
| 120 | + my_project.page_objects.ExampleProductPage my_project.page_objects.GenericProductPage ['example.com'] [] 500 {} |
| 121 | + my_project.page_objects.AnotherExampleProductPage my_project.page_objects.GenericProductPage ['anotherexample.com'] ['/digital-goods/'] 500 {} |
| 122 | + my_project.page_objects.DualExampleProductPage my_project.page_objects.GenericProductPage ['dualexample.com', 'dualexample.net'] [] 500 {} |
0 commit comments