PR for Download File From Link function #241
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves the
download_file_from_linkutility to support robust, memory-efficient downloads for large datasets and adds a dedicated test suite to ensure correct behaviour under different network conditions.Motivation
Some of the datasets used in TopoBench (e.g. those hosted on external academic servers) can be:
response.contentdownloads memory-inefficient.verify=False, which previously wasn’t configurable.The old implementation used a single
requests.getcall, loaded the entire response into memory, and did not retry on transient failures. This could lead to frequent failures or hangs when downloading large files over slow connections.What this PR does
1. Improve
download_file_from_linkThe function
download_file_from_linkintopobench.data.utils.io_utilsis updated to:os.makedirs(path_to_save, exist_ok=True).verifyargument (defaultTrue).timeoutargument(default: 60 seconds for the read timeout, 30 seconds for connection).
retriesargument.content-lengthis available:Behavioural notes:
200, the function logs an error and returns without creating a file (same high-level behaviour as before, but now explicit).2. Add tests for
download_file_from_linkThis PR introduces a new test file (e.g.
tests/data/utils/test_io_utils.py) containing a test suite fordownload_file_from_link. The tests useunittest.mockandpytestto cover:iter_contentwith multiple chunks totalling 5MB.path_to_save.404response.requests.getcall raisesrequests.exceptions.Timeout.requests.getis called twice.requests.getcalls raiserequests.exceptions.Timeout.["zip", "tar", "tar.gz"].iter_content.content-lengthheader.download_file_from_linkwithverify=False.requests.getwas invoked withverify=False.timeoutvalue.requests.getuses(30, custom_timeout)for(connect, read)timeouts.Backwards compatibility
file_link,path_to_save,dataset_name,file_format) are unchanged.verify,timeout,retries) have sensible defaults and should not break existing call sites.Testing
download_file_from_link(seetests/data/utils/test_io_utils.py).