@@ -22,7 +22,7 @@ to integrate `asyncio`-based projects such as `Playwright`.
22
22
23
23
### Minimum required versions
24
24
25
- * Python >= 3.8
25
+ * Python >= 3.9
26
26
* Scrapy >= 2.0 (!= 2.4.0)
27
27
* Playwright >= 1.15
28
28
@@ -76,15 +76,14 @@ requests will be processed by the regular Scrapy download handler.
76
76
77
77
### Twisted reactor
78
78
79
- When running on GNU/Linux or macOS you'll need to
80
- [ install the ` asyncio ` -based Twisted reactor] ( https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor ) :
79
+ [ Install the ` asyncio ` -based Twisted reactor] ( https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor ) :
81
80
82
81
``` python
83
82
# settings.py
84
83
TWISTED_REACTOR = " twisted.internet.asyncioreactor.AsyncioSelectorReactor"
85
84
```
86
85
87
- This is not a requirement on Windows (see [ Windows support ] ( #windows-support ) )
86
+ This is the default in new projects since [ Scrapy 2.7 ] ( https://github.com/scrapy/scrapy/releases/tag/2.7.0 ) .
88
87
89
88
90
89
## Basic usage
@@ -282,21 +281,24 @@ PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 10 * 1000 # 10 seconds
282
281
### ` PLAYWRIGHT_PROCESS_REQUEST_HEADERS `
283
282
Type ` Optional[Union[Callable, str]] ` , default ` scrapy_playwright.headers.use_scrapy_headers `
284
283
285
- A function (or the path to a function) that processes headers for a given request
286
- and returns a dictionary with the headers to be used (note that, depending on the browser,
287
- additional default headers could be sent as well). Coroutine functions (` async def ` ) are
288
- supported.
284
+ A function (or the path to a function) that processes a Playwright request and returns a
285
+ dictionary with headers to be overridden (note that, depending on the browser, additional
286
+ default headers could be sent as well). Coroutine functions (` async def ` ) are supported.
289
287
290
- This will be called at least once for each Scrapy request (receiving said request and the
291
- corresponding Playwright request), but it could be called additional times if the given
292
- resource generates more requests (e.g. to retrieve assets like images or scripts).
288
+ This will be called at least once for each Scrapy request, but it could be called additional times
289
+ if Playwright generates more requests (e.g. to retrieve assets like images or scripts).
293
290
294
- The function must return a ` dict ` object, and receives the following positional arguments:
291
+ The function must return a ` Dict[str, str] ` object, and receives the following three ** keyword ** arguments:
295
292
296
293
``` python
297
- - browser_type : str
294
+ - browser_type_name : str
298
295
- playwright_request: playwright.async_api.Request
299
- - scrapy_headers: scrapy.http.headers.Headers
296
+ - scrapy_request_data: dict
297
+ * method: str
298
+ * url: str
299
+ * headers: scrapy.http.headers.Headers
300
+ * body: Optional[bytes ]
301
+ * encoding: str
300
302
```
301
303
302
304
The default function (` scrapy_playwright.headers.use_scrapy_headers ` ) tries to
@@ -311,6 +313,38 @@ set by Playwright will be sent. Keep in mind that in this case, headers passed
311
313
via the ` Request.headers ` attribute or set by Scrapy components are ignored
312
314
(including cookies set via the ` Request.cookies ` attribute).
313
315
316
+ Example:
317
+ ``` python
318
+ async def custom_headers (
319
+ * ,
320
+ browser_type_name : str ,
321
+ playwright_request : playwright.async_api.Request,
322
+ scrapy_request_data : dict ,
323
+ ) -> Dict[str , str ]:
324
+ headers = await playwright_request.all_headers()
325
+ scrapy_headers = scrapy_request_data[" headers" ].to_unicode_dict()
326
+ headers[" Cookie" ] = scrapy_headers.get(" Cookie" )
327
+ return headers
328
+
329
+ PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers
330
+ ```
331
+
332
+ #### Deprecated argument handling
333
+
334
+ In version 0.0.40 and earlier, arguments were passed to the function positionally,
335
+ and only the Scrapy headers were passed instead of a dictionary with data about the
336
+ Scrapy request.
337
+ This is deprecated since version 0.0.41, and support for this way of handling arguments
338
+ will eventually be removed in accordance with the [ Deprecation policy] ( #deprecation-policy ) .
339
+
340
+ Passed arguments:
341
+ ``` python
342
+ - browser_type: str
343
+ - playwright_request: playwright.async_api.Request
344
+ - scrapy_headers: scrapy.http.headers.Headers
345
+ ```
346
+
347
+ Example:
314
348
``` python
315
349
def custom_headers (
316
350
browser_type : str ,
@@ -827,10 +861,12 @@ down or clicking links) and you want to handle only the final result in your cal
827
861
828
862
### ` PageMethod ` class
829
863
830
- #### ` scrapy_playwright.page.PageMethod(method: str, *args, **kwargs) ` :
864
+ #### ` scrapy_playwright.page.PageMethod(method: str | callable , *args, **kwargs) ` :
831
865
832
866
Represents a method to be called (and awaited if necessary) on a
833
867
` playwright.page.Page ` object (e.g. "click", "screenshot", "evaluate", etc).
868
+ It's also possible to pass callable objects that will be invoked as callbacks
869
+ and receive Playwright Page as argument.
834
870
` method ` is the name of the method, ` *args ` and ` **kwargs `
835
871
are passed when calling such method. The return value
836
872
will be stored in the ` PageMethod.result ` attribute.
@@ -868,8 +904,34 @@ async def parse(self, response, **kwargs):
868
904
await page.close()
869
905
```
870
906
907
+ ### Passing callable objects
908
+
909
+ If a ` PageMethod ` receives a callable object as its first argument, it will be
910
+ called with the page as its first argument. Any additional arguments are passed
911
+ to the callable after the page.
912
+
913
+ ``` python
914
+ async def scroll_page (page : Page) -> str :
915
+ await page.wait_for_selector(selector = " div.quote" )
916
+ await page.evaluate(" window.scrollBy(0, document.body.scrollHeight)" )
917
+ await page.wait_for_selector(selector = " div.quote:nth-child(11)" )
918
+ return page.url
919
+
920
+
921
+ class MySpyder (scrapy .Spider ):
922
+ name = " scroll"
923
+
924
+ def start_requests (self ):
925
+ yield Request(
926
+ url = " https://quotes.toscrape.com/scroll" ,
927
+ meta = {
928
+ " playwright" : True ,
929
+ " playwright_page_methods" : [PageMethod(scroll_page)],
930
+ },
931
+ )
932
+ ```
871
933
872
- ### Supported methods
934
+ ### Supported Playwright methods
873
935
874
936
Refer to the [ upstream docs for the ` Page ` class] ( https://playwright.dev/python/docs/api/class-page )
875
937
to see available methods.
0 commit comments