-
Notifications
You must be signed in to change notification settings - Fork 4
added Playwright support to fetch_html #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@yurijmikhalevich |
|
@tulu-g559, thank you for submitting this PR! I think it's fine to just always use playwright. How would you go about detecting whether the website is static or not? Easier to just use playwright everytime + this keeps the code simpler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please share a screenshot or a video of this solution working?
Also, can you share how much RAM does calling this function consume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single fetch_html call typically uses 250 MB–600 MB of RAM, depending on the page, as depends mostly on Playwright (Chromium)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tulu-g559, that's not little.
Should we ensure we only have 1 Chromium instance running, and, if there are multiple parallel calls to fetch_html, queue them, so they reuse that single instance?
If we do this, we can keep the RAM usage under control.
| "User-Agent": "Minerva AI Bot - (https://github.com/move-fast-and-break-things/minerva)" | ||
| }) | ||
|
|
||
| await page.goto(url, wait_until="networkidle") # wait until JS is mostly done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a regular page.goto will be enough. Usually, it resolves when the page is loaded.
yurijmikhalevich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is a great first stab at a problem 🙌 Please review my comments and let me know if I can help!
Co-authored-by: Yurij Mikhalevich <[email protected]>
Co-authored-by: Yurij Mikhalevich <[email protected]>
| # TIMEOUT_SEC = 2 | ||
| TIMEOUT_SEC = 10 # give more time for JS sites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's remove the old code here, too
| # TIMEOUT_SEC = 2 | |
| TIMEOUT_SEC = 10 # give more time for JS sites | |
| TIMEOUT_SEC = 10 |
|
@tulu-g559, I am closing this PR as stale. Please reopen it if you have an update. |

Changes made for Issue #29
async_playwrighthelper that launches Chromium in headless mode and extracts the rendered DOM.@yurijmikhalevich Need suggestions
Do I need to do hybrid approach???