Skip to content

Commit 8b15232

Browse files
committed
Merge remote-tracking branch 'origin/main' into close-inactive-contexts
2 parents 891c95b + b387b42 commit 8b15232

20 files changed

+534
-129
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 0.0.39
2+
current_version = 0.0.44
33
commit = True
44
tag = True
55

.github/workflows/checks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: Checks
2-
on: [push, pull_request]
2+
on: [push, pull_request, workflow_dispatch]
33

44
jobs:
55
checks:

.github/workflows/tests.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: Tests
2-
on: [push, pull_request]
2+
on: [push, pull_request, workflow_dispatch]
33

44
jobs:
55
tests:
@@ -9,12 +9,12 @@ jobs:
99
strategy:
1010
fail-fast: false
1111
matrix:
12-
os: [ubuntu-latest]
13-
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
12+
os: [ubuntu-22.04]
13+
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
1414
include:
15-
- os: macos-latest
15+
- os: macos-14
1616
python-version: "3.12"
17-
- os: windows-latest
17+
- os: windows-2022
1818
python-version: "3.12"
1919

2020
steps:

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,5 @@ coverage-twisted/
2222
node_modules/
2323
package-lock.json
2424
package.json
25+
26+
.idea

README.md

Lines changed: 78 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ to integrate `asyncio`-based projects such as `Playwright`.
2222

2323
### Minimum required versions
2424

25-
* Python >= 3.8
25+
* Python >= 3.9
2626
* Scrapy >= 2.0 (!= 2.4.0)
2727
* Playwright >= 1.15
2828

@@ -76,15 +76,14 @@ requests will be processed by the regular Scrapy download handler.
7676

7777
### Twisted reactor
7878

79-
When running on GNU/Linux or macOS you'll need to
80-
[install the `asyncio`-based Twisted reactor](https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor):
79+
[Install the `asyncio`-based Twisted reactor](https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor):
8180

8281
```python
8382
# settings.py
8483
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
8584
```
8685

87-
This is not a requirement on Windows (see [Windows support](#windows-support))
86+
This is the default in new projects since [Scrapy 2.7](https://github.com/scrapy/scrapy/releases/tag/2.7.0).
8887

8988

9089
## Basic usage
@@ -282,21 +281,24 @@ PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 10 * 1000 # 10 seconds
282281
### `PLAYWRIGHT_PROCESS_REQUEST_HEADERS`
283282
Type `Optional[Union[Callable, str]]`, default `scrapy_playwright.headers.use_scrapy_headers`
284283

285-
A function (or the path to a function) that processes headers for a given request
286-
and returns a dictionary with the headers to be used (note that, depending on the browser,
287-
additional default headers could be sent as well). Coroutine functions (`async def`) are
288-
supported.
284+
A function (or the path to a function) that processes a Playwright request and returns a
285+
dictionary with headers to be overridden (note that, depending on the browser, additional
286+
default headers could be sent as well). Coroutine functions (`async def`) are supported.
289287

290-
This will be called at least once for each Scrapy request (receiving said request and the
291-
corresponding Playwright request), but it could be called additional times if the given
292-
resource generates more requests (e.g. to retrieve assets like images or scripts).
288+
This will be called at least once for each Scrapy request, but it could be called additional times
289+
if Playwright generates more requests (e.g. to retrieve assets like images or scripts).
293290

294-
The function must return a `dict` object, and receives the following positional arguments:
291+
The function must return a `Dict[str, str]` object, and receives the following three **keyword** arguments:
295292

296293
```python
297-
- browser_type: str
294+
- browser_type_name: str
298295
- playwright_request: playwright.async_api.Request
299-
- scrapy_headers: scrapy.http.headers.Headers
296+
- scrapy_request_data: dict
297+
* method: str
298+
* url: str
299+
* headers: scrapy.http.headers.Headers
300+
* body: Optional[bytes]
301+
* encoding: str
300302
```
301303

302304
The default function (`scrapy_playwright.headers.use_scrapy_headers`) tries to
@@ -311,6 +313,38 @@ set by Playwright will be sent. Keep in mind that in this case, headers passed
311313
via the `Request.headers` attribute or set by Scrapy components are ignored
312314
(including cookies set via the `Request.cookies` attribute).
313315

316+
Example:
317+
```python
318+
async def custom_headers(
319+
*,
320+
browser_type_name: str,
321+
playwright_request: playwright.async_api.Request,
322+
scrapy_request_data: dict,
323+
) -> Dict[str, str]:
324+
headers = await playwright_request.all_headers()
325+
scrapy_headers = scrapy_request_data["headers"].to_unicode_dict()
326+
headers["Cookie"] = scrapy_headers.get("Cookie")
327+
return headers
328+
329+
PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers
330+
```
331+
332+
#### Deprecated argument handling
333+
334+
In version 0.0.40 and earlier, arguments were passed to the function positionally,
335+
and only the Scrapy headers were passed instead of a dictionary with data about the
336+
Scrapy request.
337+
This is deprecated since version 0.0.41, and support for this way of handling arguments
338+
will eventually be removed in accordance with the [Deprecation policy](#deprecation-policy).
339+
340+
Passed arguments:
341+
```python
342+
- browser_type: str
343+
- playwright_request: playwright.async_api.Request
344+
- scrapy_headers: scrapy.http.headers.Headers
345+
```
346+
347+
Example:
314348
```python
315349
def custom_headers(
316350
browser_type: str,
@@ -827,10 +861,12 @@ down or clicking links) and you want to handle only the final result in your cal
827861

828862
### `PageMethod` class
829863

830-
#### `scrapy_playwright.page.PageMethod(method: str, *args, **kwargs)`:
864+
#### `scrapy_playwright.page.PageMethod(method: str | callable, *args, **kwargs)`:
831865

832866
Represents a method to be called (and awaited if necessary) on a
833867
`playwright.page.Page` object (e.g. "click", "screenshot", "evaluate", etc).
868+
It's also possible to pass callable objects that will be invoked as callbacks
869+
and receive Playwright Page as argument.
834870
`method` is the name of the method, `*args` and `**kwargs`
835871
are passed when calling such method. The return value
836872
will be stored in the `PageMethod.result` attribute.
@@ -868,8 +904,34 @@ async def parse(self, response, **kwargs):
868904
await page.close()
869905
```
870906

907+
### Passing callable objects
908+
909+
If a `PageMethod` receives a callable object as its first argument, it will be
910+
called with the page as its first argument. Any additional arguments are passed
911+
to the callable after the page.
912+
913+
```python
914+
async def scroll_page(page: Page) -> str:
915+
await page.wait_for_selector(selector="div.quote")
916+
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
917+
await page.wait_for_selector(selector="div.quote:nth-child(11)")
918+
return page.url
919+
920+
921+
class MySpyder(scrapy.Spider):
922+
name = "scroll"
923+
924+
def start_requests(self):
925+
yield Request(
926+
url="https://quotes.toscrape.com/scroll",
927+
meta={
928+
"playwright": True,
929+
"playwright_page_methods": [PageMethod(scroll_page)],
930+
},
931+
)
932+
```
871933

872-
### Supported methods
934+
### Supported Playwright methods
873935

874936
Refer to the [upstream docs for the `Page` class](https://playwright.dev/python/docs/api/class-page)
875937
to see available methods.

docs/changelog.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,37 @@
11
# scrapy-playwright changelog
22

3+
4+
### [v0.0.44](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.44) (2025-08-13)
5+
6+
* Fix crawl getting stuck on Windows with Scrapy>=2.13 (#351)
7+
8+
9+
### [v0.0.43](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.43) (2025-02-22)
10+
11+
* Only register request and response loggers when needed (#336)
12+
13+
14+
### [v0.0.42](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.42) (2024-11-06)
15+
16+
* Allow custom PageMethod callbacks (#318)
17+
* Fix download errors caused by Content-Encoding header (#322)
18+
19+
20+
### [v0.0.41](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.41) (2024-08-13)
21+
22+
* Keyword arguments for PLAYWRIGHT_PROCESS_REQUEST_HEADERS, pass additional Request data (#303).
23+
Deprecated positional argument handling for the function passed to the PLAYWRIGHT_PROCESS_REQUEST_HEADERS
24+
setting, arguments should now be handled by keyword.
25+
* Retry to create page on browser crash (#305)
26+
* Fix typo in log message (#312)
27+
28+
29+
### [v0.0.40](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.40) (2024-07-16)
30+
31+
* Enforce asyncio reactor in all platforms (#298)
32+
* Allow multiple handlers in separate thread (#299)
33+
34+
335
### [v0.0.39](https://github.com/scrapy-plugins/scrapy-playwright/releases/tag/v0.0.39) (2024-07-11)
436

537
* Return proper status and headers for downloads (#293)

pylintrc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,6 @@ disable=
2020
[FORMAT]
2121
expected-line-ending-format=LF
2222
max-line-length=99
23+
24+
[MASTER]
25+
ignored-classes=twisted.internet.reactor

scrapy_playwright/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.0.39"
1+
__version__ = "0.0.44"

scrapy_playwright/_utils.py

Lines changed: 50 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@
22
import logging
33
import platform
44
import threading
5-
from typing import Awaitable, Iterator, Optional, Tuple, Union
5+
from typing import Awaitable, Dict, Iterator, Optional, Tuple, Union
66

77
import scrapy
88
from playwright.async_api import Error, Page, Request, Response
99
from scrapy.http.headers import Headers
1010
from scrapy.settings import Settings
1111
from scrapy.utils.python import to_unicode
1212
from twisted.internet.defer import Deferred
13+
from twisted.python import failure
1314
from w3lib.encoding import html_body_declared_encoding, http_content_type_encoding
1415

1516

@@ -103,68 +104,61 @@ async def _get_header_value(
103104
return None
104105

105106

106-
if platform.system() == "Windows":
107-
108-
class _ThreadedLoopAdapter:
109-
"""Utility class to start an asyncio event loop in a new thread and redirect coroutines.
110-
This allows to run Playwright in a different loop than the Scrapy crawler, allowing to
111-
use ProactorEventLoop which is supported by Playwright on Windows.
112-
"""
113-
114-
_loop: asyncio.AbstractEventLoop
115-
_thread: threading.Thread
116-
_coro_queue: asyncio.Queue = asyncio.Queue()
117-
_stop_event: asyncio.Event = asyncio.Event()
118-
119-
@classmethod
120-
async def _handle_coro(cls, coro, future) -> None:
121-
try:
122-
future.set_result(await coro)
123-
except Exception as exc:
124-
future.set_exception(exc)
125-
126-
@classmethod
127-
async def _process_queue(cls) -> None:
128-
while not cls._stop_event.is_set():
129-
coro, future = await cls._coro_queue.get()
130-
asyncio.create_task(cls._handle_coro(coro, future))
131-
cls._coro_queue.task_done()
132-
133-
@classmethod
134-
def _deferred_from_coro(cls, coro) -> Deferred:
135-
future: asyncio.Future = asyncio.Future()
136-
asyncio.run_coroutine_threadsafe(cls._coro_queue.put((coro, future)), cls._loop)
137-
return scrapy.utils.defer.deferred_from_coro(future)
138-
139-
@classmethod
140-
def start(cls) -> None:
141-
policy = asyncio.WindowsProactorEventLoopPolicy() # type: ignore[attr-defined]
107+
class _ThreadedLoopAdapter:
108+
"""Utility class to start an asyncio event loop in a new thread and redirect coroutines.
109+
This allows to run Playwright in a different loop than the Scrapy crawler, allowing to
110+
use ProactorEventLoop which is supported by Playwright on Windows.
111+
"""
112+
113+
_loop: asyncio.AbstractEventLoop
114+
_thread: threading.Thread
115+
_coro_queue: asyncio.Queue = asyncio.Queue()
116+
_stop_events: Dict[int, asyncio.Event] = {}
117+
118+
@classmethod
119+
async def _handle_coro(cls, coro: Awaitable, dfd: Deferred) -> None:
120+
from twisted.internet import reactor
121+
122+
try:
123+
result = await coro
124+
except Exception as exc:
125+
reactor.callFromThread(dfd.errback, failure.Failure(exc))
126+
else:
127+
reactor.callFromThread(dfd.callback, result)
128+
129+
@classmethod
130+
async def _process_queue(cls) -> None:
131+
while any(not ev.is_set() for ev in cls._stop_events.values()):
132+
coro, dfd = await cls._coro_queue.get()
133+
asyncio.create_task(cls._handle_coro(coro, dfd))
134+
cls._coro_queue.task_done()
135+
136+
@classmethod
137+
def _deferred_from_coro(cls, coro) -> Deferred:
138+
dfd: Deferred = Deferred()
139+
asyncio.run_coroutine_threadsafe(cls._coro_queue.put((coro, dfd)), cls._loop)
140+
return dfd
141+
142+
@classmethod
143+
def start(cls, caller_id: int) -> None:
144+
cls._stop_events[caller_id] = asyncio.Event()
145+
if not getattr(cls, "_loop", None):
146+
policy = asyncio.DefaultEventLoopPolicy()
147+
if platform.system() == "Windows":
148+
policy = asyncio.WindowsProactorEventLoopPolicy() # type: ignore[attr-defined]
142149
cls._loop = policy.new_event_loop()
143-
asyncio.set_event_loop(cls._loop)
144150

151+
if not getattr(cls, "_thread", None):
145152
cls._thread = threading.Thread(target=cls._loop.run_forever, daemon=True)
146153
cls._thread.start()
147154
logger.info("Started loop on separate thread: %s", cls._loop)
148-
149155
asyncio.run_coroutine_threadsafe(cls._process_queue(), cls._loop)
150156

151-
@classmethod
152-
def stop(cls) -> None:
153-
cls._stop_event.set()
157+
@classmethod
158+
def stop(cls, caller_id: int) -> None:
159+
"""Wait until all handlers are closed to stop the event loop and join the thread."""
160+
cls._stop_events[caller_id].set()
161+
if all(ev.is_set() for ev in cls._stop_events.values()):
154162
asyncio.run_coroutine_threadsafe(cls._coro_queue.join(), cls._loop)
155163
cls._loop.call_soon_threadsafe(cls._loop.stop)
156164
cls._thread.join()
157-
158-
_deferred_from_coro = _ThreadedLoopAdapter._deferred_from_coro
159-
else:
160-
161-
class _ThreadedLoopAdapter: # type: ignore[no-redef]
162-
@classmethod
163-
def start(cls) -> None:
164-
pass
165-
166-
@classmethod
167-
def stop(cls) -> None:
168-
pass
169-
170-
_deferred_from_coro = scrapy.utils.defer.deferred_from_coro

0 commit comments

Comments
 (0)