Skip to content

Remote Playwright support #1006

@esistgut

Description

@esistgut

Is your feature request related to a problem? Please describe.
The library uses Playwright but Playwright has some issues running on Linux distributions other than the supported ones, which pretty much includes only Ubuntu. This is an excerpt of playwright install on my local Archlinux system:

BEWARE: your OS is not officially supported by Playwright; downloading fallback build for ubuntu20.04-x64.
Downloading FFMPEG playwright build v1011 from https://cdn.playwright.dev/dbazure/download/playwright/builds/ffmpeg/1011/ffmpeg-linux.zip
2.3 MiB [====================] 100% 0.0s
FFMPEG playwright build v1011 downloaded to /home/esistgut/.cache/ms-playwright/ffmpeg-1011
Playwright Host validation warning: 
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Missing libraries:                                   ║
║     libicudata.so.66                                 ║
║     libicui18n.so.66                                 ║
║     libicuuc.so.66                                   ║
║     libxml2.so.2                                     ║
║     libwebp.so.6                                     ║
║     libffi.so.7                                      ║
╚══════════════════════════════════════════════════════╝

Describe the solution you'd like
Playwright already has a solution for this problem: https://playwright.dev/docs/docker#remote-connection but ScrapeGraph-AI should allow to use it somehow.

Describe alternatives you've considered
This is a monkey patched example from the documentation:

import os
import asyncio
from dotenv import load_dotenv

# --- ScrapeGraphAI imports ---
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

# --- Monkey-patch: use a remote Playwright server for ChromiumLoader ---
from scrapegraphai.docloaders.chromium import ChromiumLoader
from playwright.async_api import async_playwright

load_dotenv()

# Read your LLM key + WS endpoint
openai_key = os.getenv("OPENAI_API_KEY")
WS_ENDPOINT = os.getenv("PW_WS_ENDPOINT", "ws://127.0.0.1:3000/")

# Replace ChromiumLoader.ascrape_playwright with a remote-connecting version
async def _remote_ascrape_playwright(self, url: str) -> str:
    # self.browser_config / self.TIMEOUT come from ChromiumLoader (ScrapeGraphAI)
    timeout_ms = getattr(self, "TIMEOUT", 30000)
    browser_config = getattr(self, "browser_config", {}) or {}

    async with async_playwright() as p:
        # Connect to the remote Playwright server (Docker)
        browser = await p.chromium.connect(WS_ENDPOINT)
        try:
            ctx = await browser.new_context(**browser_config)
            page = await ctx.new_page()
            # networkidle is a good default for JS-heavy pages
            await page.goto(url, wait_until="networkidle", timeout=timeout_ms)
            html = await page.content()
        finally:
            # Clean up the context; closing the browser ends the remote session
            await ctx.close()
            await browser.close()
        return html

# Monkey-patch it on the class so all internal calls use the remote connection
ChromiumLoader.ascrape_playwright = _remote_ascrape_playwright

# ---------------------------------------------------------------
# Your graph as usual
# ---------------------------------------------------------------
graph_config = {
    "llm": {
        "api_key": openai_key,
        "model": "openai/gpt-4o",
    },
    # Optional: pass extra options to the internal loader (e.g., proxy)
    # This is how ScrapeGraphAI forwards extra params to the loader
    "loader_kwargs": {
        # "proxy": {"http": "http://user:pass@host:port"}  # example if needed
    },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their description.",
    source="https://perinim.github.io/projects/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)

Additional context
Please note: I'm new to both ScrapeGraphAI and Playwright, so I may have missed something very obvious.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions