Skip to content

Conversation

crepererum
Copy link
Contributor

@crepererum crepererum commented May 26, 2025

Which issue does this PR close?

Rationale for this change

The file handle has several drawbacks as it does NOT specify the following properties:

  • mode: Is the file opened read-only or does it also have write access (which would be bad)?
  • seek position: At which position is the current file "read head"? Are users expected to seek to the file start if they want to read the entire file?
  • direct IO: Is the file opened using direct IO? (in which case the user would have to read in aligned blocks)
  • exclusive access: Does the API user have exclusive access to the said file handle (and hence can use seek-based operations) or not (in which case only non-seeking operations like read_at would be required)

At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile.

All object store users and integrating libraries potentially have to deal with streams AND file handles.

What changes are included in this PR?

Remove GetResultPayload::File (and the entire enum).

Are there any user-facing changes?

Files are no longer exposed.

@crepererum crepererum added the next-major-release the PR has API changes and it waiting on the next major version label May 26, 2025
@crepererum crepererum force-pushed the crepererum/issue18 branch 2 times, most recently from 5a107b5 to afc7f6c Compare May 26, 2025 14:16
@crepererum crepererum force-pushed the crepererum/issue18 branch from afc7f6c to 21531f3 Compare June 16, 2025 08:17
@crepererum crepererum marked this pull request as ready for review June 16, 2025 08:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this change -- in theory it is sound, but given how widely used object_store is I am not sure I have any great idea of how it is used

If we want to proceed with removing the GetResultPayload I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.

At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile

Do we have any way to measure this? object_store has 175 reverse dependencies on crates.io (which is significantly more than just parquet and direct dependents): https://crates.io/crates/object_store/reverse_dependencies?page=13

Screenshot 2025-06-17 at 6 35 33 AM

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

FYI @Xuanwo

@crepererum
Copy link
Contributor Author

If we want to proceed with removing the GetResultPayload I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.

I don't think there's a way to make this non-breaking and TBH I think we have to face one reality: the object_store interface is messy at the moment and to make it 1.0 proof, there will be breaking changes; and not all of them can be phased.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

If we want to proceed with removing the GetResultPayload I think we should first deprecate the code for a release cycle or two and see how many users show up and worry / offer opinions about it.

I don't think there's a way to make this non-breaking and TBH I think we have to face one reality: the object_store interface is messy at the moment and to make it 1.0 proof, there will be breaking changes; and not all of them can be phased.

I am not convinced that this particular API needs to be changed and I think we need a better justification than that the API is complicated. The original idea as I understood it, for this API, is that special casing File operations could make significant performance improvements so the more complicated API was justified.

At the same time I suspect that most users copy the file data into a buffer anyways and don't use any kernel-based IO operations like sendfile.

For example, looking at just the DataFusion code, I see quite a few different code paths for File vs Stream:

https://github.com/search?q=repo%3Aapache%2Fdatafusion%20GetResultPayload%3A%3AFile&type=code

I am not claiming that all those cases couldn't be made as performant using a Stream, but I also haven't done the diligence.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

Specifically, I am concerned that this PR removes functionality that there is no alternative to implement with the object_store crate, rather than just making the API easier to work with

@crepererum
Copy link
Contributor Author

What I see in DataFusion is a bunch of complicated code that is there to deal with the "file" case, not a true optimization for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next-major-release the PR has API changes and it waiting on the next major version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove file-handle from object store GET operations
2 participants