About

A Scrapy downloader middleware that stores response HTMLs to disk.

Usage

Turn downloader on, e.g. specifying it in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_html_storage.HtmlStorageMiddleware': 10,
}

None of responses by default are saved to disk. You must select for which requests the response HTMLs will be saved:

def parse(self, response):
    """Processes start urls.

    Args:
        response (HtmlResponse): scrapy HTML response object.
    """
    yield scrapy.Request(
        'http://target.com',
        callback=self.parse_target,
        meta={
          'save_html': True,
        }
    )

The file path where HTML will be stored is resolved with spider method response_html_path. E.g.:

class TargetSpider(scrapy.Spider):
    def response_html_path(self, request):
        """
        Args:
            request (scrapy.http.request.Request): request that produced the
                response.
        """
        return 'html/last_response.html'

Configuration

HTML storage downloader middleware supports such options:

gzip_output (bool) - if True, HTML output will be stored in gzip format. Default is False.
save_html_on_status (list) - if not empty, sets list of response codes whitelisted for html saving. If list is empty or not provided, all response codes will be allowed for html saving.

Sample:

HTML_STORAGE = {
    "gzip_output": True,
    "save_html_on_status": [200, 202],
    "save_by_url" = [
        "https://website.com/index.html",
        ".*",
        "website.com/section-\w+.html", 
    ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
requirements		requirements
scrapy_html_storage		scrapy_html_storage
tests		tests
.gitignore		.gitignore
CHANGELOG.rst		CHANGELOG.rst
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Usage

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

povilasb/scrapy-html-storage

Folders and files

Latest commit

History

Repository files navigation

About

Usage

Configuration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages