-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
Description
Is your feature request related to a problem? Please describe.
The library uses Playwright but Playwright has some issues running on Linux distributions other than the supported ones, which pretty much includes only Ubuntu. This is an excerpt of playwright install on my local Archlinux system:
BEWARE: your OS is not officially supported by Playwright; downloading fallback build for ubuntu20.04-x64.
Downloading FFMPEG playwright build v1011 from https://cdn.playwright.dev/dbazure/download/playwright/builds/ffmpeg/1011/ffmpeg-linux.zip
2.3 MiB [====================] 100% 0.0s
FFMPEG playwright build v1011 downloaded to /home/esistgut/.cache/ms-playwright/ffmpeg-1011
Playwright Host validation warning:
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Missing libraries: ║
║ libicudata.so.66 ║
║ libicui18n.so.66 ║
║ libicuuc.so.66 ║
║ libxml2.so.2 ║
║ libwebp.so.6 ║
║ libffi.so.7 ║
╚══════════════════════════════════════════════════════╝
Describe the solution you'd like
Playwright already has a solution for this problem: https://playwright.dev/docs/docker#remote-connection but ScrapeGraph-AI should allow to use it somehow.
Describe alternatives you've considered
This is a monkey patched example from the documentation:
import os
import asyncio
from dotenv import load_dotenv
# --- ScrapeGraphAI imports ---
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
# --- Monkey-patch: use a remote Playwright server for ChromiumLoader ---
from scrapegraphai.docloaders.chromium import ChromiumLoader
from playwright.async_api import async_playwright
load_dotenv()
# Read your LLM key + WS endpoint
openai_key = os.getenv("OPENAI_API_KEY")
WS_ENDPOINT = os.getenv("PW_WS_ENDPOINT", "ws://127.0.0.1:3000/")
# Replace ChromiumLoader.ascrape_playwright with a remote-connecting version
async def _remote_ascrape_playwright(self, url: str) -> str:
# self.browser_config / self.TIMEOUT come from ChromiumLoader (ScrapeGraphAI)
timeout_ms = getattr(self, "TIMEOUT", 30000)
browser_config = getattr(self, "browser_config", {}) or {}
async with async_playwright() as p:
# Connect to the remote Playwright server (Docker)
browser = await p.chromium.connect(WS_ENDPOINT)
try:
ctx = await browser.new_context(**browser_config)
page = await ctx.new_page()
# networkidle is a good default for JS-heavy pages
await page.goto(url, wait_until="networkidle", timeout=timeout_ms)
html = await page.content()
finally:
# Clean up the context; closing the browser ends the remote session
await ctx.close()
await browser.close()
return html
# Monkey-patch it on the class so all internal calls use the remote connection
ChromiumLoader.ascrape_playwright = _remote_ascrape_playwright
# ---------------------------------------------------------------
# Your graph as usual
# ---------------------------------------------------------------
graph_config = {
"llm": {
"api_key": openai_key,
"model": "openai/gpt-4o",
},
# Optional: pass extra options to the internal loader (e.g., proxy)
# This is how ScrapeGraphAI forwards extra params to the loader
"loader_kwargs": {
# "proxy": {"http": "http://user:pass@host:port"} # example if needed
},
}
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description.",
source="https://perinim.github.io/projects/",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)Additional context
Please note: I'm new to both ScrapeGraphAI and Playwright, so I may have missed something very obvious.
dosubot