added Playwright support to fetch_html #50

tulu-g559 · 2025-10-02T07:13:09Z

Changes made for Issue #29

Rewrote fetch_html to support both httpx (fast path for static sites) and Playwright (fallback for CSR/JS-heavy sites).
Added async_playwright helper that launches Chromium in headless mode and extracts the rendered DOM.
Updated clean_html to avoid shadowed variables and handle empty/invalid HTML safely.
Increased timeout for Playwright requests.
Preserved existing cleaning pipeline (lxml.Cleaner) to keep output consistent.

@yurijmikhalevich Need suggestions
Do I need to do hybrid approach???

try: 
     async with httpx.AsyncClient() as client:
except Exception:
        # ignore and fall back to playwright

    return await fetch_html_playwright(url)

tulu-g559 · 2025-10-03T11:44:51Z

@yurijmikhalevich
Any updates???

yurijmikhalevich · 2025-10-07T11:39:39Z

@tulu-g559, thank you for submitting this PR!

I think it's fine to just always use playwright. How would you go about detecting whether the website is static or not? Easier to just use playwright everytime + this keeps the code simpler.

minerva/tools/fetch_html.py

yurijmikhalevich · 2025-10-07T11:41:46Z

minerva/tools/fetch_html.py

Can you please share a screenshot or a video of this solution working?

Also, can you share how much RAM does calling this function consume?

single fetch_html call typically uses 250 MB–600 MB of RAM, depending on the page, as depends mostly on Playwright (Chromium)

@tulu-g559, that's not little.

Should we ensure we only have 1 Chromium instance running, and, if there are multiple parallel calls to fetch_html, queue them, so they reuse that single instance?

If we do this, we can keep the RAM usage under control.

yurijmikhalevich · 2025-10-07T11:43:35Z

minerva/tools/fetch_html.py

+            "User-Agent": "Minerva AI Bot - (https://github.com/move-fast-and-break-things/minerva)"
+        })
+
+        await page.goto(url, wait_until="networkidle")  # wait until JS is mostly done


Waiting for networkidle may never resolve because some websites do continuous polling in the background.

Docs:

Maybe a regular page.goto will be enough. Usually, it resolves when the page is loaded.

yurijmikhalevich

Thank you! This is a great first stab at a problem 🙌 Please review my comments and let me know if I can help!

Co-authored-by: Yurij Mikhalevich <[email protected]>

yurijmikhalevich · 2025-10-12T10:03:09Z

minerva/tools/fetch_html.py

+# TIMEOUT_SEC = 2
+TIMEOUT_SEC = 10  # give more time for JS sites


nit: let's remove the old code here, too

Suggested change

# TIMEOUT_SEC = 2

TIMEOUT_SEC = 10 # give more time for JS sites

TIMEOUT_SEC = 10

yurijmikhalevich · 2025-11-09T23:49:42Z

@tulu-g559, I am closing this PR as stale. Please reopen it if you have an update.

added Playwright support to fetch_html

418dbb7

tulu-g559 requested a review from yurijmikhalevich as a code owner October 2, 2025 07:13

yurijmikhalevich reviewed Oct 7, 2025

View reviewed changes

minerva/tools/fetch_html.py Outdated Show resolved Hide resolved

yurijmikhalevich reviewed Oct 7, 2025

View reviewed changes

minerva/tools/fetch_html.py Outdated Show resolved Hide resolved

yurijmikhalevich reviewed Oct 7, 2025

View reviewed changes

yurijmikhalevich requested changes Oct 7, 2025

View reviewed changes

tulu-g559 and others added 2 commits October 12, 2025 12:44

Update minerva/tools/fetch_html.py

27bdb9f

Co-authored-by: Yurij Mikhalevich <[email protected]>

Update minerva/tools/fetch_html.py

c9a2ab8

Co-authored-by: Yurij Mikhalevich <[email protected]>

yurijmikhalevich reviewed Oct 12, 2025

View reviewed changes

yurijmikhalevich closed this Nov 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added Playwright support to fetch_html #50

added Playwright support to fetch_html #50

Uh oh!

tulu-g559 commented Oct 2, 2025 •

edited

Loading

Uh oh!

tulu-g559 commented Oct 3, 2025

Uh oh!

yurijmikhalevich commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

yurijmikhalevich Oct 7, 2025

Uh oh!

tulu-g559 Oct 12, 2025

Uh oh!

yurijmikhalevich Oct 12, 2025

Uh oh!

yurijmikhalevich Oct 7, 2025

Uh oh!

yurijmikhalevich Oct 12, 2025

Uh oh!

yurijmikhalevich left a comment

Uh oh!

yurijmikhalevich Oct 12, 2025

Uh oh!

yurijmikhalevich commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# TIMEOUT_SEC = 2
		TIMEOUT_SEC = 10 # give more time for JS sites

	# TIMEOUT_SEC = 2
	TIMEOUT_SEC = 10 # give more time for JS sites
	TIMEOUT_SEC = 10

added Playwright support to fetch_html #50

added Playwright support to fetch_html #50

Uh oh!

Conversation

tulu-g559 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes made for Issue #29

Uh oh!

tulu-g559 commented Oct 3, 2025

Uh oh!

yurijmikhalevich commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

yurijmikhalevich Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

tulu-g559 Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich left a comment

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

yurijmikhalevich commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tulu-g559 commented Oct 2, 2025 •

edited

Loading