@@ -22,7 +22,7 @@ to integrate `asyncio`-based projects such as `Playwright`.
2222
2323### Minimum required versions
2424
25- * Python >= 3.8
25+ * Python >= 3.9
2626* Scrapy >= 2.0 (!= 2.4.0)
2727* Playwright >= 1.15
2828
@@ -76,15 +76,14 @@ requests will be processed by the regular Scrapy download handler.
7676
7777### Twisted reactor
7878
79- When running on GNU/Linux or macOS you'll need to
80- [ install the ` asyncio ` -based Twisted reactor] ( https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor ) :
79+ [ Install the ` asyncio ` -based Twisted reactor] ( https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor ) :
8180
8281``` python
8382# settings.py
8483TWISTED_REACTOR = " twisted.internet.asyncioreactor.AsyncioSelectorReactor"
8584```
8685
87- This is not a requirement on Windows (see [ Windows support ] ( #windows-support ) )
86+ This is the default in new projects since [ Scrapy 2.7 ] ( https://github.com/scrapy/scrapy/releases/tag/2.7.0 ) .
8887
8988
9089## Basic usage
@@ -282,21 +281,24 @@ PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 10 * 1000 # 10 seconds
282281### ` PLAYWRIGHT_PROCESS_REQUEST_HEADERS `
283282Type ` Optional[Union[Callable, str]] ` , default ` scrapy_playwright.headers.use_scrapy_headers `
284283
285- A function (or the path to a function) that processes headers for a given request
286- and returns a dictionary with the headers to be used (note that, depending on the browser,
287- additional default headers could be sent as well). Coroutine functions (` async def ` ) are
288- supported.
284+ A function (or the path to a function) that processes a Playwright request and returns a
285+ dictionary with headers to be overridden (note that, depending on the browser, additional
286+ default headers could be sent as well). Coroutine functions (` async def ` ) are supported.
289287
290- This will be called at least once for each Scrapy request (receiving said request and the
291- corresponding Playwright request), but it could be called additional times if the given
292- resource generates more requests (e.g. to retrieve assets like images or scripts).
288+ This will be called at least once for each Scrapy request, but it could be called additional times
289+ if Playwright generates more requests (e.g. to retrieve assets like images or scripts).
293290
294- The function must return a ` dict ` object, and receives the following positional arguments:
291+ The function must return a ` Dict[str, str] ` object, and receives the following three ** keyword ** arguments:
295292
296293``` python
297- - browser_type : str
294+ - browser_type_name : str
298295- playwright_request: playwright.async_api.Request
299- - scrapy_headers: scrapy.http.headers.Headers
296+ - scrapy_request_data: dict
297+ * method: str
298+ * url: str
299+ * headers: scrapy.http.headers.Headers
300+ * body: Optional[bytes ]
301+ * encoding: str
300302```
301303
302304The default function (` scrapy_playwright.headers.use_scrapy_headers ` ) tries to
@@ -311,6 +313,38 @@ set by Playwright will be sent. Keep in mind that in this case, headers passed
311313via the ` Request.headers ` attribute or set by Scrapy components are ignored
312314(including cookies set via the ` Request.cookies ` attribute).
313315
316+ Example:
317+ ``` python
318+ async def custom_headers (
319+ * ,
320+ browser_type_name : str ,
321+ playwright_request : playwright.async_api.Request,
322+ scrapy_request_data : dict ,
323+ ) -> Dict[str , str ]:
324+ headers = await playwright_request.all_headers()
325+ scrapy_headers = scrapy_request_data[" headers" ].to_unicode_dict()
326+ headers[" Cookie" ] = scrapy_headers.get(" Cookie" )
327+ return headers
328+
329+ PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers
330+ ```
331+
332+ #### Deprecated argument handling
333+
334+ In version 0.0.40 and earlier, arguments were passed to the function positionally,
335+ and only the Scrapy headers were passed instead of a dictionary with data about the
336+ Scrapy request.
337+ This is deprecated since version 0.0.41, and support for this way of handling arguments
338+ will eventually be removed in accordance with the [ Deprecation policy] ( #deprecation-policy ) .
339+
340+ Passed arguments:
341+ ``` python
342+ - browser_type: str
343+ - playwright_request: playwright.async_api.Request
344+ - scrapy_headers: scrapy.http.headers.Headers
345+ ```
346+
347+ Example:
314348``` python
315349def custom_headers (
316350 browser_type : str ,
@@ -827,10 +861,12 @@ down or clicking links) and you want to handle only the final result in your cal
827861
828862### ` PageMethod ` class
829863
830- #### ` scrapy_playwright.page.PageMethod(method: str, *args, **kwargs) ` :
864+ #### ` scrapy_playwright.page.PageMethod(method: str | callable , *args, **kwargs) ` :
831865
832866Represents a method to be called (and awaited if necessary) on a
833867` playwright.page.Page ` object (e.g. "click", "screenshot", "evaluate", etc).
868+ It's also possible to pass callable objects that will be invoked as callbacks
869+ and receive Playwright Page as argument.
834870` method ` is the name of the method, ` *args ` and ` **kwargs `
835871are passed when calling such method. The return value
836872will be stored in the ` PageMethod.result ` attribute.
@@ -868,8 +904,34 @@ async def parse(self, response, **kwargs):
868904 await page.close()
869905```
870906
907+ ### Passing callable objects
908+
909+ If a ` PageMethod ` receives a callable object as its first argument, it will be
910+ called with the page as its first argument. Any additional arguments are passed
911+ to the callable after the page.
912+
913+ ``` python
914+ async def scroll_page (page : Page) -> str :
915+ await page.wait_for_selector(selector = " div.quote" )
916+ await page.evaluate(" window.scrollBy(0, document.body.scrollHeight)" )
917+ await page.wait_for_selector(selector = " div.quote:nth-child(11)" )
918+ return page.url
919+
920+
921+ class MySpyder (scrapy .Spider ):
922+ name = " scroll"
923+
924+ def start_requests (self ):
925+ yield Request(
926+ url = " https://quotes.toscrape.com/scroll" ,
927+ meta = {
928+ " playwright" : True ,
929+ " playwright_page_methods" : [PageMethod(scroll_page)],
930+ },
931+ )
932+ ```
871933
872- ### Supported methods
934+ ### Supported Playwright methods
873935
874936Refer to the [ upstream docs for the ` Page ` class] ( https://playwright.dev/python/docs/api/class-page )
875937to see available methods.
0 commit comments