Skip to content

Conversation

@marindigen
Copy link

@marindigen marindigen commented Nov 24, 2025

Summary

This PR improves the download_file_from_link utility to support robust, memory-efficient downloads for large datasets and adds a dedicated test suite to ensure correct behaviour under different network conditions.

Motivation

Some of the datasets used in TopoBench (e.g. those hosted on external academic servers) can be:

  • Very large, making response.content downloads memory-inefficient.
  • Slow or unstable, leading to timeouts or partial downloads.
  • Occasionally requiring verify=False, which previously wasn’t configurable.

The old implementation used a single requests.get call, loaded the entire response into memory, and did not retry on transient failures. This could lead to frequent failures or hangs when downloading large files over slow connections.

What this PR does

1. Improve download_file_from_link

The function download_file_from_link in topobench.data.utils.io_utils is updated to:

  • Stream the response in 5MB chunks instead of loading it all into memory.
  • Ensure the target directory exists via os.makedirs(path_to_save, exist_ok=True).
  • Support SSL verification control via a verify argument (default True).
  • Support configurable per-chunk read timeout via a timeout argument
    (default: 60 seconds for the read timeout, 30 seconds for connection).
  • Add retry logic with exponential backoff on failures, controlled by a retries argument.
  • Print download progress when content-length is available:
    • Total size (in GB)
    • Percentage completed
    • Approximate download speed (MB/s)
    • ETA in hours and minutes
  • Handle unknown content length gracefully and still stream the file.
  • Raise an exception after all retry attempts are exhausted, instead of silently failing.

Behavioural notes:

  • For HTTP status codes other than 200, the function logs an error and returns without creating a file (same high-level behaviour as before, but now explicit).
  • For persistent network errors (e.g. repeated timeouts), the function retries and finally raises the underlying exception on the last failure.

2. Add tests for download_file_from_link

This PR introduces a new test file (e.g. tests/data/utils/test_io_utils.py) containing a test suite for download_file_from_link. The tests use unittest.mock and pytest to cover:

  • Successful streaming download with progress:
    • Mocks iter_content with multiple chunks totalling 5MB.
    • Asserts the output file exists and has the expected size.
  • Automatic directory creation:
    • Uses a nested, non-existent directory in path_to_save.
    • Verifies that the directory is created and the file is written.
  • HTTP error handling:
    • Mocks a 404 response.
    • Asserts that no file is created.
  • Retry on timeout:
    • First requests.get call raises requests.exceptions.Timeout.
    • Second call returns a successful mock response.
    • Verifies that the file is created and that requests.get is called twice.
  • Exhausting retries:
    • All requests.get calls raise requests.exceptions.Timeout.
    • Asserts that the function raises after the configured number of retries.
  • Support for multiple file formats:
    • Loops over ["zip", "tar", "tar.gz"].
    • Verifies that files with the correct extensions are created.
  • Handling empty chunks:
    • Includes empty chunks in iter_content.
    • Ensures the final file size only includes non-empty chunks.
  • Unknown content length:
    • Omits the content-length header.
    • Verifies that the file is still correctly written.
  • SSL verification toggle:
    • Calls download_file_from_link with verify=False.
    • Asserts that requests.get was invoked with verify=False.
  • Custom timeout:
    • Calls the function with a custom timeout value.
    • Asserts that requests.get uses (30, custom_timeout) for (connect, read) timeouts.

Backwards compatibility

  • The function name, module location, and core signature (file_link, path_to_save, dataset_name, file_format) are unchanged.
  • New keyword arguments (verify, timeout, retries) have sensible defaults and should not break existing call sites.
  • The main change in behaviour is that persistent network failures now raise an exception after all retries instead of only printing an error. This makes failures explicit and easier to debug, while not affecting successful downloads.

Testing

  • New tests added for download_file_from_link (see tests/data/utils/test_io_utils.py).
  • All tests pass locally:
pytest tests/data/utils/test_io_utils.py

…o check and test the training. To be able to download dataset in the function 'download_file_from_link' in requests.get() verify parameter should be specified as False. Note also that currently the run script on the data doesn't run as it fails to download data even if verify parameter set to False
…ll_to_dict and process_mat. I have also modified download_file_from_link by specifying verify=False in requests.get()
marindigen added a commit to marindigen/TopoBenchForNeuro that referenced this pull request Nov 26, 2025
…ion. Move the test class to appropriate file. Note, that the same changes were done in the PR geometric-intelligence#241 (they are duplicated here, as the script wouldn't run otherwise and would require additional adaptation to the old download_file_from_link function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant