Skip to content

Conversation

@CTTY
Copy link
Collaborator

@CTTY CTTY commented Nov 25, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?


// Deletion operations
async fn delete(&self, path: &str) -> Result<()>;
async fn remove_dir_all(&self, path: &str) -> Result<()>;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating comments from @Fokko

This name feels very much file-system like, while Iceberg is designed to work against object stores. How about delete_prefix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

#[async_trait]
pub trait Storage: Debug + Send + Sync {
// File existence and metadata
async fn exists(&self, path: &str) -> Result<bool>;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating comments from @c-thiel

I know that we use these results everywhere, but I think introducing more specific error types that we can match on for storage operations makes sense. They can implement Into of course.
For example, a RateLimited error that we got from the storage service should be treated differently from NotFound or CredentialsExpired.
With Lakekeeper we are currently using our own trait based IO due to many limitations in iceberg-rust, mainly due to unsupported signing mechanisms, missing refresh mechanisms, intransparent errors and missing extendability.
I would gladly switch to iceberg-rust if we get these solved.
Maybe this can serve as some inspiration: https://github.com/lakekeeper/lakekeeper/blob/b8fcf54c627d48a547ef0baf6863949b68579388/crates/io/src/error.rs#L291

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address @c-thiel 's comments, we have several approaches:

  1. Introduce another set of errors for storage.
  2. Extend current ErrorKind for storage errors.
  3. Extend current ErrorKind, but with another enum, for example
pub enum IoErrorKind {
    FileNotFound,
    CredentialExpired,
}

pub enum ErrorKind {
     // Existing variants
    ...
    Io(IoErrorKind)
}

// File object creation
fn new_input(&self, path: &str) -> Result<InputFile>;
fn new_output(&self, path: &str) -> Result<OutputFile>;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating comments from @c-thiel

Many object stores have a good way of running batch deletions, for example the DeleteObjects API in AWS S3. How would you feel about including a delete_batch method too?

@CTTY CTTY changed the title rfc: Making Storage a trait rfc: Making Storage a Trait Nov 25, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CTTY for this pr, generally LGTM! One missing point is, I want the StorageBuilderRegistry to have some built in StorageBuilder registered when user creating a new catalog instance. I currenlty don't have a good solution, one approach would be to have a standalone crate, which loads built in StorageBuilders when StorageBuilderRegistry is initiated. And then we could have catalog crates to depend on it.

#[async_trait]
pub trait Storage: Debug + Send + Sync {
// File existence and metadata
async fn exists(&self, path: &str) -> Result<bool>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address @c-thiel 's comments, we have several approaches:

  1. Introduce another set of errors for storage.
  2. Extend current ErrorKind for storage errors.
  3. Extend current ErrorKind, but with another enum, for example
pub enum IoErrorKind {
    FileNotFound,
    CredentialExpired,
}

pub enum ErrorKind {
     // Existing variants
    ...
    Io(IoErrorKind)
}

#[async_trait]
pub trait Storage: Debug + Send + Sync {
// File existence and metadata
async fn exists(&self, path: &str) -> Result<bool>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn exists(&self, path: &str) -> Result<bool>;
async fn exists(&self, path: AsRef<str>) -> Result<bool>;

A little more rusty.

Copy link
Collaborator Author

@CTTY CTTY Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breaks object dyn compatibility


// Deletion operations
async fn delete(&self, path: &str) -> Result<()>;
async fn remove_dir_all(&self, path: &str) -> Result<()>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

The `StorageBuilder` trait defines how storage backends are constructed:

```rust
pub trait StorageBuilder: Debug + Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub trait StorageBuilder: Debug + Send + Sync {
pub trait StorageFactory: Debug + Send + Sync {

nit: Factory sounds a litte better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be Serializable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I haven't figured out how to have both Serializable and dyn Trait yet, and I'm planning to explore more with typetag. Serializability is a huge pain and I'll add more details once I have more clarity

pub fn new() -> Self { /* ... */ }
pub fn register(&mut self, scheme: impl Into<String>, builder: Arc<dyn StorageBuilder>);
pub fn get_builder(&self, scheme: &str) -> Result<Arc<dyn StorageBuilder>>;
pub fn supported_types(&self) -> Vec<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn supported_types(&self) -> Vec<String>;
pub fn supported_types(&self) -> impl Iterator<Item=&str>>;

use iceberg::io::FileIOBuilder;

// Basic usage (same as the existing code)
let file_io = FileIOBuilder::new("s3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need FileIOBuilder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that FileIOBuilder seems excessive. I'm keeping FileIOBuilder here for now mainly because I'm uncertain where to keep Extensions right now mainly due to serde-related concerns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After StorageBuilder becomes serializable, we should be able to remove it.

FileIOBuilder::build()
StorageBuilderRegistry::new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we should remove FileIOBuilder since then?
  2. The StorageBuilderRegistery should be an instance in catalog instance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants