Skip to content

Conversation

@tulu-g559
Copy link

@tulu-g559 tulu-g559 commented Oct 2, 2025

Changes made for Issue #29

  • Rewrote fetch_html to support both httpx (fast path for static sites) and Playwright (fallback for CSR/JS-heavy sites).
  • Added async_playwright helper that launches Chromium in headless mode and extracts the rendered DOM.
  • Updated clean_html to avoid shadowed variables and handle empty/invalid HTML safely.
  • Increased timeout for Playwright requests.
  • Preserved existing cleaning pipeline (lxml.Cleaner) to keep output consistent.

@yurijmikhalevich Need suggestions
Do I need to do hybrid approach???

try: 
     async with httpx.AsyncClient() as client:
except Exception:
        # ignore and fall back to playwright

    return await fetch_html_playwright(url)

@tulu-g559
Copy link
Author

@yurijmikhalevich
Any updates???

@yurijmikhalevich
Copy link
Member

@tulu-g559, thank you for submitting this PR!

I think it's fine to just always use playwright. How would you go about detecting whether the website is static or not? Easier to just use playwright everytime + this keeps the code simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share a screenshot or a video of this solution working?

Also, can you share how much RAM does calling this function consume?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single fetch_html call typically uses 250 MB–600 MB of RAM, depending on the page, as depends mostly on Playwright (Chromium)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tulu-g559, that's not little.

Should we ensure we only have 1 Chromium instance running, and, if there are multiple parallel calls to fetch_html, queue them, so they reuse that single instance?

If we do this, we can keep the RAM usage under control.

"User-Agent": "Minerva AI Bot - (https://github.com/move-fast-and-break-things/minerva)"
})

await page.goto(url, wait_until="networkidle") # wait until JS is mostly done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for networkidle may never resolve because some websites do continuous polling in the background.

Docs:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a regular page.goto will be enough. Usually, it resolves when the page is loaded.

Copy link
Member

@yurijmikhalevich yurijmikhalevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is a great first stab at a problem 🙌 Please review my comments and let me know if I can help!

tulu-g559 and others added 2 commits October 12, 2025 12:44
Co-authored-by: Yurij Mikhalevich <[email protected]>
Co-authored-by: Yurij Mikhalevich <[email protected]>
Comment on lines +8 to +9
# TIMEOUT_SEC = 2
TIMEOUT_SEC = 10 # give more time for JS sites
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's remove the old code here, too

Suggested change
# TIMEOUT_SEC = 2
TIMEOUT_SEC = 10 # give more time for JS sites
TIMEOUT_SEC = 10

@yurijmikhalevich
Copy link
Member

@tulu-g559, I am closing this PR as stale. Please reopen it if you have an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants