Skip to content

Commit 0a0ee12

Browse files
committed
add docs for overrides
1 parent 7495b58 commit 0a0ee12

File tree

4 files changed

+129
-2
lines changed

4 files changed

+129
-2
lines changed

docs/api_reference.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _`api-reference`:
2+
13
=============
24
API Reference
35
=============
@@ -47,11 +49,13 @@ Mixins
4749
:no-special-members:
4850

4951

52+
.. _`api-overrides`:
53+
5054
Overrides
5155
=========
5256

5357
.. autofunction:: web_poet.handle_urls
5458

5559
.. automodule:: web_poet.overrides
5660
:members:
57-
:exclude-members: handle_urls
61+
:exclude-members: handle_urls

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
3333

3434
intro/tutorial
3535
intro/from-ground-up
36+
intro/overrides
3637

3738
.. toctree::
3839
:caption: Reference

docs/intro/overrides.rst

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
.. _`intro-overrides`:
2+
3+
Overrides
4+
=========
5+
6+
Overrides contains mapping rules to associate which URLs a particular
7+
Page Object would be used. The URL matching rules is handled by another library
8+
called `url-matcher <https://url-matcher.readthedocs.io>`_.
9+
10+
Using such matching rules establishes the core concept of Overrides wherein
11+
its able to use specific Page Objects in lieu of the original one.
12+
13+
This enables ``web-poet`` to be used effectively by other frameworks like
14+
`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.
15+
16+
Example Use Case
17+
----------------
18+
19+
Let's explore an example use case for the Overrides concept.
20+
21+
Suppose we're using Page Objects for our broadcrawl project which explores
22+
eCommerce websites to discover product pages. It wouldn't be entirely possible
23+
for us to create parsers for all websites since we don't know which sites we're
24+
going to crawl beforehand.
25+
26+
However, we could at least create a generic Page Object to support parsing of
27+
some fields in well-known locations of product information like ``<title>``.
28+
This enables our broadcrawler to at least parse some useful information. Let's
29+
call such Page Object to be ``GenericProductPage``.
30+
31+
Assuming that one of our project requirements is to fully support parsing of the
32+
`top 3 eCommerce websites`, then we'd need to create a Page Object for each one
33+
to parse more specific fields.
34+
35+
Here's where the Overrides concept comes in:
36+
37+
1. The ``GenericProductPage`` is used to parse all eCommerce product pages
38+
`by default`.
39+
2. Whenever one of our declared URL rules matches with a given page URL,
40+
then the Page Object associated with that rule `overrides (or replaces)`
41+
the default ``GenericProductPage``.
42+
43+
This enables us to fine tune our parsing logic `(which are abstracted away for
44+
each Page Object)` depending on the page we're parsing.
45+
46+
Let's see this in action by creating Page Objects below.
47+
48+
49+
Creating Overrides
50+
------------------
51+
52+
Let's take a look at how the following code is structured:
53+
54+
.. code-block:: python
55+
56+
from web_poet import handle_urls
57+
from web_poet.pages import ItemWebPage
58+
59+
class GenericProductPage(ItemWebPage):
60+
def to_item(self):
61+
return {"product title": self.css("title::text").get()}
62+
63+
@handle_urls("example.com", overrides=GenericProductPage)
64+
class ExampleProductPage(ItemWebPage):
65+
def to_item(self):
66+
... # more specific parsing
67+
68+
@handle_urls("anotherexample.com", overrides=GenericProductPage, exclude="/digital-goods/")
69+
class AnotherExampleProductPage(ItemWebPage):
70+
def to_item(self):
71+
... # more specific parsing
72+
73+
@handle_urls(["dualexample.com", "dualexample.net"], overrides=GenericProductPage)
74+
class DualExampleProductPage(ItemWebPage):
75+
def to_item(self):
76+
... # more specific parsing
77+
78+
The code above declares that:
79+
80+
- For sites that matches the ``example.com`` pattern, ``ExampleProductPage``
81+
would be used instead of ``GenericProductPage``.
82+
- The same is true for ``YetAnotherExampleProductPage`` where it is used
83+
instead of ``GenericProductPage`` for two URLs: ``dualexample.com`` and
84+
``dualexample.net``.
85+
- However, ``AnotherExampleProductPage`` is only used instead of ``GenericProductPage``
86+
when we're parsing pages from ``anotherexample.com`` which doesn't contain
87+
``/digital-goods/`` in its URL path.
88+
89+
The override mechanism that ``web-poet`` offers could also still be further
90+
customized. You can read some of the specific parameters and alternative ways
91+
to organize the rules via the :ref:`Overrides API section <api-overrides>`.
92+
93+
94+
Viewing all available Overrides
95+
-------------------------------
96+
97+
A convenience function is available discover and retrieve all rules from your
98+
project. Make sure to check out :ref:`Overrides API section <api-overrides>`
99+
to see the other functionalities of ``find_page_object_overrides``.
100+
101+
.. code-block::
102+
103+
from web_poet import find_page_object_overrides
104+
105+
rules = find_page_object_overrides("my_project.page_objects")
106+
107+
print(len(rules)) # 3
108+
109+
print(rules[0]) # OverrideRule(for_patterns=Patterns(include=['example.com'], exclude=[], priority=500), use=<class 'my_project.page_objects.ExampleProductPage'>, instead_of=<class 'my_project.page_objects.GenericProductPage'>, meta={})
110+
111+
112+
A handy CLI tool is also available at your disposal to quickly see the available
113+
Override rules in a given module in your project. For example, invoking something
114+
like ``web_poet my_project.page_objects`` would produce the following:
115+
116+
.. code-block::
117+
118+
Use this instead of for the URL patterns except for the patterns with priority meta
119+
---------------------------------------------------- ------------------------------------------ -------------------------------------- ------------------------- --------------- ------
120+
my_project.page_objects.ExampleProductPage my_project.page_objects.GenericProductPage ['example.com'] [] 500 {}
121+
my_project.page_objects.AnotherExampleProductPage my_project.page_objects.GenericProductPage ['anotherexample.com'] ['/digital-goods/'] 500 {}
122+
my_project.page_objects.DualExampleProductPage my_project.page_objects.GenericProductPage ['dualexample.com', 'dualexample.net'] [] 500 {}

docs/intro/tutorial.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,4 +131,4 @@ As you can see, it's possible to use web-poet with built-in libraries such as
131131
`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.
132132

133133
If you want to understand the idea behind web-poet better,
134-
check the :ref:`from-ground-up` tutorial.
134+
check the :ref:`from-ground-up` tutorial.

0 commit comments

Comments
 (0)