Skip to content

Commit bd3a88e

Browse files
committed
enable ease of combining external Page Object packages
1 parent f626efc commit bd3a88e

File tree

3 files changed

+199
-7
lines changed

3 files changed

+199
-7
lines changed

docs/intro/overrides.rst

Lines changed: 171 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -316,7 +316,7 @@ instances of the :class:`~.PageObjectRegistry` instead:
316316
cool_gadget_fr_registry = PageObjectRegistry()
317317
furniture_shop_registry = PageObjectRegistry()
318318
319-
After declaring the :class:`~.PageObjectRegistry` instances, they can be imported
319+
After declaring the :class:`~.PageObjectRegistry` instances, they can be used
320320
in each of the Page Object packages like so:
321321

322322
.. code-block:: python
@@ -432,3 +432,173 @@ Retrieving all of the Product Listing Override rules would simply be:
432432
433433
# We can also filter it down further on a per site basis if needed.
434434
rules = product_listings_registry.get_overrides_from("my_page_obj_project.cool_gadget_site")
435+
436+
Using Overrides from External Packages
437+
--------------------------------------
438+
439+
Developers have the option to import existing Page Objects alongside the Override
440+
Rules attached to them. This section aims to showcase different ways you can
441+
play with the Registries to manipulate the Override Rules according to your needs.
442+
443+
Let's suppose we have the following use case before us:
444+
445+
- An external Python package named ``ecommerce_page_objects`` is available
446+
which contains Page Objects for common websites. It's using the
447+
``default_registry`` from **web-poet**.
448+
- Another similar package named ``gadget_sites_page_objects`` is available
449+
for more specific websites. It's using its own registry named
450+
``gadget_registry``.
451+
- Your project's objectives is to handle as much eCommerce websites as you
452+
can. Thus, you'd want to use the already available packages above and
453+
perhaps improve on them or create new Page Objects for new websites.
454+
455+
Assuming that you'd want to **use all existing Override rules from the external
456+
packages** in your project, you can do it like:
457+
458+
.. code-block:: python
459+
460+
import ecommerce_page_objects
461+
import gadget_sites_page_objects
462+
from web_poet import PageObjectRegistry, consume_modules, default_registry
463+
464+
consume_modules("ecommerce_page_objects", "gadget_sites_page_objects")
465+
466+
combined_registry = PageObjectRegistry()
467+
combined_registry.data = {
468+
# Since ecommerce_page_objects is using web_poet.default_registry, then
469+
# it functions like a global registry which we can access as:
470+
**default_registry.data,
471+
472+
**gadget_sites_page_objects.gadget_registry.data,
473+
}
474+
475+
combined_rules = combined_registry.get_overrides()
476+
477+
# The combined_rules would be as follows:
478+
# 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={})
479+
# 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={})
480+
# 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={})
481+
# 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={})
482+
483+
.. note::
484+
485+
Note that ``registry.get_overrides() == list(registry.data.values())``. We're
486+
using ``registry.data`` for these cases so that we can easily look up specific
487+
Page Objects using the ``dict``'s key. Otherwise, it may become a problem on
488+
large cases with lots of Override rules.
489+
490+
.. note::
491+
492+
If you don't need the entire data contents of Registries, then you can opt
493+
to use :meth:`~.PageObjectRegistry.data_from` to easily filter them out
494+
per package/module.
495+
496+
Here's an example:
497+
498+
.. code-block:: python
499+
500+
default_registry.data_from("ecommerce_page_objects.site_1", "ecommerce_page_objects.site_2")
501+
502+
As you can see in the example above, we can easily combine the data from multiple
503+
different registries as it simply follows a ``Dict[Callable, OverrideRule]``
504+
structure. There won't be any duplication or clashes of ``dict`` keys between
505+
registries of different external packages since the keys are the Page Object
506+
classes intended to be used. From our example above, the ``dict`` keys from a
507+
given ``data`` registry attribute would be:
508+
509+
1. ``<class 'ecommerce_page_objects.site_1.EcomSite1'>``
510+
2. ``<class 'ecommerce_page_objects.site_2.EcomSite2'>``
511+
3. ``<class 'gadget_sites_page_objects.site_2.GadgetSite2'>``
512+
4. ``<class 'gadget_sites_page_objects.site_3.GadgetSite3'>``
513+
514+
As you might've observed, combining the two Registries above may result in a
515+
conflict for the Override rules for **#2** and **#3**:
516+
517+
.. code-block:: python
518+
519+
# 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={})
520+
# 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={})
521+
522+
The `url-matcher`_ library is the one responsible breaking such conflicts. It's
523+
specifically discussed in this section: `rules-conflict-resolution
524+
<https://url-matcher.readthedocs.io/en/stable/intro.html#rules-conflict-resolution>`_.
525+
526+
However, it's technically **NOT** a conflict, **yet**, since:
527+
528+
- ``ecommerce_page_objects.site_2.EcomSite2`` would only be used in **site_2.com**
529+
if ``ecommerce_page_objects.EcomGenericPage`` is to be replaced.
530+
- The same case with ``gadget_sites_page_objects.site_2.GadgetSite2`` wherein
531+
it's only going to be utilized for **site_2.com** if the following is to be
532+
replaced: ``gadget_sites_page_objects.GadgetGenericPage``.
533+
534+
It would be only become a conflict if the **#2** and **#3** Override Rules for
535+
**site_2.com** both intend to replace the same Page Object. In fact, none of the
536+
Override Rules above would ever be used if your project never intends to use the
537+
following Page Objects *(since there's nothing to override)*. You can import
538+
these Page Objects into your project and use them so they can be overridden:
539+
540+
- ``ecommerce_page_objects.EcomGenericPage``
541+
- ``gadget_sites_page_objects.GadgetGenericPage``
542+
543+
However, let's assume that you want to create your own generic Page Object and
544+
only intend to use it instead of the ones above. We can easily replace them like:
545+
546+
.. code-block:: python
547+
548+
class ImprovedEcommerceGenericPage:
549+
def to_item(self):
550+
... # different type of generic parsers
551+
552+
for _, rule in combined_registry.data.items():
553+
rule.instead_of = ImprovedEcommerceGenericPage
554+
555+
updated_rules = combined_registry.get_overrides()
556+
557+
# The updated_rules would be as follows:
558+
# 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
559+
# 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
560+
# 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
561+
# 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
562+
563+
Now, **#2** and **#3** have a conflict since they now both intend to replace
564+
``ImprovedEcommerceGenericPage``. As mentioned earlier, the `url-matcher`_
565+
would be the one to resolve such conflicts.
566+
567+
However, it would help prevent future confusion if we could remove the source of
568+
ambiguity in our Override Rules.
569+
570+
Suppose, we prefer ``gadget_sites_page_objects.site_2.GadgetSite2`` more than
571+
``ecommerce_page_objects.site_2.EcomSite2``. As such, we could remove the latter:
572+
573+
.. code-block:: python
574+
575+
del combined_registry.data[ecommerce_page_objects.site_2.EcomSite2]
576+
577+
updated_rules = combined_registry.get_overrides()
578+
579+
# The newly updated_rules would be as follows:
580+
# 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
581+
# 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
582+
# 3. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
583+
584+
As discussed before, the Registry's data is structured simply as
585+
``Dict[Callable, OverrideRule]`` for which we can easily manipulate it via ``dict``
586+
operations.
587+
588+
Now, suppose we want to improve ``ecommerce_page_objects.site_1.EcomSite1``
589+
from **#1** above by perhaps adding/fixing fields. We can do that by:
590+
591+
.. code-block:: python
592+
593+
class ImprovedEcomSite1(ecommerce_page_objects.site_1.EcomSite1):
594+
def to_item(self):
595+
... # replace and improve some of the parsers here
596+
597+
combined_registry.data[ecommerce_page_objects.site_1.EcomSite1].use = ImprovedEcomSite1
598+
599+
updated_rules = combined_registry.get_overrides()
600+
601+
# The newly updated_rules would be as follows:
602+
# 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'my_project.ImprovedEcomSite1'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
603+
# 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})
604+
# 3. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={})

tests/test_overrides.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,12 @@ def test_list_page_objects_from_imported_registry():
119119
assert pones.for_patterns == Patterns(["example.com"])
120120
assert pones.instead_of == PONestedModuleOverridenSecondary
121121

122+
def test_registry_data_from():
123+
data = default_registry.data_from("tests.po_lib.nested_package")
124+
125+
assert len(data) == 2
126+
assert PONestedModule in data
127+
assert PONestedPkg in data
122128

123129
def test_cmd():
124130
from web_poet.__main__ import main

web_poet/overrides.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from url_matcher import Patterns
1111

1212

13-
@dataclass(frozen=True)
13+
@dataclass
1414
class OverrideRule:
1515
"""A single override rule that specifies when a page object should be used
1616
instead of another."""
@@ -97,7 +97,7 @@ class ExampleComProductPage(ItemPage):
9797
"""
9898

9999
def __init__(self):
100-
self.data: Dict[Callable, OverrideRule] = {}
100+
self._data: Dict[Callable, OverrideRule] = {}
101101

102102
def handle_urls(
103103
self,
@@ -140,8 +140,8 @@ def wrapper(cls):
140140
meta=kwargs,
141141
)
142142
# If it was already defined, we don't want to override it
143-
if cls not in self.data:
144-
self.data[cls] = rule
143+
if cls not in self._data:
144+
self._data[cls] = rule
145145
else:
146146
warnings.warn(
147147
f"Multiple @handle_urls annotations with the same 'overrides' "
@@ -165,7 +165,7 @@ def get_overrides(self) -> List[OverrideRule]:
165165
This enables the :meth:`~.PageObjectRegistry.handle_urls` that annotates
166166
the external Page Objects to be properly loadeded.
167167
"""
168-
return list(self.data.values())
168+
return list(self._data.values())
169169

170170
def get_overrides_from(self, *pkgs_or_modules: str) -> List[OverrideRule]:
171171
"""Returns the override rules that were declared using ``@handle_urls``
@@ -186,7 +186,7 @@ def get_overrides_from(self, *pkgs_or_modules: str) -> List[OverrideRule]:
186186
def _filter_from_module(self, module: str) -> Dict[Callable, OverrideRule]:
187187
return {
188188
cls: rule
189-
for cls, rule in self.data.items()
189+
for cls, rule in self._data.items()
190190

191191
# A "." is added at the end to prevent incorrect matching on cases
192192
# where package names are substrings of one another. For example,
@@ -197,6 +197,22 @@ def _filter_from_module(self, module: str) -> Dict[Callable, OverrideRule]:
197197
if cls.__module__.startswith(module + ".") or cls.__module__ == module
198198
}
199199

200+
@property
201+
def data(self) -> Dict[Callable, OverrideRule]:
202+
return self._data # pragma: no cover
203+
204+
@data.setter
205+
def data(self, value: Dict[Callable, OverrideRule]) -> None:
206+
self._data = value # pragma: no cover
207+
208+
def data_from(self, *pkgs_or_modules: str) -> Dict[Callable, OverrideRule]:
209+
"""Return ``data`` values that are filtered by package/module."""
210+
211+
results = {}
212+
for item in pkgs_or_modules:
213+
results.update(self._filter_from_module(item))
214+
return results
215+
200216

201217
# For ease of use, we'll create a default registry so that users can simply
202218
# use its `handle_urls()` method directly by `from web_poet import handle_urls`

0 commit comments

Comments
 (0)