-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Background
A Filename registry is maintained to persist even after files have been removed, to prevent re-upload, re-use of that exact filename.
When a release's files are removed, the ability to surface their storage location is effectively lost, as the path generator tool uses hashers to determine the placement from the file's hashes, which are also not preserved.
warehouse/warehouse/forklift/legacy.py
Lines 1525 to 1536 in aeb9ccd
# Figure out what our filepath is going to be, we're going to use a | |
# directory structure based on the hash of the file contents. This | |
# will ensure that the contents of the file cannot change without | |
# it also changing the path that the file is saved too. | |
path="/".join( | |
[ | |
file_hashes[PATH_HASHER][:2], | |
file_hashes[PATH_HASHER][2:4], | |
file_hashes[PATH_HASHER][4:], | |
filename, | |
] | |
), |
The path
data exists in the BigQuery Project Metadata Table so it's semi-available, but harder to get to during routine operations and investigations.
This could also feasibly be used via Inspector or something similar, if made available via some API.
Proposal
A few steps to tackle the problem, can definitely change based on further findings or ideas.
- Add
Filename.path
column (easy) - Populate
Filename.path
during file upload, around here (easy) - Backfill the column from existing data in
File.path
for filenames that match (easyish) - Backfill the column for remaining empty entries from BigQuery data (medium to hard)