Skip to content

Conversation

rok
Copy link
Member

@rok rok commented Aug 16, 2025

Rationale for this change

#8029 introduced pub ArrowWriter.get_column_writers and pub ArrowWriter.append_row_group to enable multi-threaded parquet encrypted writing. However testing downstream showed the API is not feasible, see #8115.

What changes are included in this PR?

This introduces pub ArrowWriter.into_serialized_writer and deprecates pub ArrowWriter.get_column_writers and pub ArrowWriter.append_row_group. It also makes ArrowRowGroupWriterFactory public and adds a pub ArrowRowGroupWriterFactory.create_column_writers.

Are these changes tested?

This includes a DataFusion inspired test for concurrent writing across columns and row groups to make sure parallel writing is and remains possible with ArrowWriters API. Further we created a draft PR in DataFusion apache/datafusion#16738 to test for multithreaded writing support.

Are there any user-facing changes?

See description of changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 16, 2025
@rok rok force-pushed the multi-threaded_encrypted_writing_3 branch 4 times, most recently from bcab6f9 to b701695 Compare August 22, 2025 15:03
@rok rok marked this pull request as ready for review August 22, 2025 15:12
@rok
Copy link
Member Author

rok commented Aug 22, 2025

@alamb @adamreeve I think this will make for a better multithreading API.
Unfortunately this misses the release window and I assume we now have to deprecate a couple of ArrowWriter methods. Do we have a policy for this @alamb ?

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me thanks Rok, just some minor comments

@rok
Copy link
Member Author

rok commented Sep 1, 2025

Thanks for the review @adamreeve ! I suppose it's time to ask @alamb to do a pass :)

Copy link
Contributor

@albertlockett albertlockett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rok and @adamreeve -- I went over the PR more carefully and it makes sense to me. The examples are quite cool.

Thanks again

@alamb
Copy link
Contributor

alamb commented Sep 15, 2025

I merged up from main and plan to merge this PR in once the tests pass

@alamb
Copy link
Contributor

alamb commented Sep 15, 2025

Hi @rok and @adamreeve -- there appears to be a clippy error now (due to a new API that was added on main). Can you possibly take a look?

Again, I apologize for the delay in review

@rok
Copy link
Member Author

rok commented Sep 15, 2025

Thanks for the review @alamb!
It seems we missed a codepath. I'll fix it tomorrow.

@rok
Copy link
Member Author

rok commented Sep 15, 2025

@alamb I've pushed a couple of ignores just to confirm the codepath. Please don't merge yet.

@mbrobbel mbrobbel added this to the 56.2.0 milestone Sep 16, 2025
@rok
Copy link
Member Author

rok commented Sep 17, 2025

@alamb the deprecation warning comes from pub get_column_writers and pub append_row_group being added to AsyncArrowWriter since last release. These (append_row_group, append_row_group) were added to ArrowWriter just prior the last release, but turned out to not be great API for downstream writing due to locking issues so we decided to deprecate them to as they were not needed and to reduce pub area as suggested here.
We would suggest removing pub get_column_writers and pub append_row_group from both ArrowWriter and AsyncArrowWriter before the next release and this PR reflects that.

@rok rok force-pushed the multi-threaded_encrypted_writing_3 branch from 15ed2de to c4e38db Compare September 17, 2025 10:39
@rok
Copy link
Member Author

rok commented Sep 17, 2025

I don't have the mixed sync/async-write test quite right yet, but I wonder if we really need it?

@alamb
Copy link
Contributor

alamb commented Sep 17, 2025

I am getting ready to create the 56.2.0 release candidate. Shall we try and get this one in, or can it wait for 57.0.0 in October?

@rok
Copy link
Member Author

rok commented Sep 17, 2025

@alamb it'd be great to get this one in. Let me take a look if I can fix up the test_async_arrow_group_writer test.

@adamreeve
Copy link
Contributor

The motivation for #8262 seems to be making the AsyncArrowWriter API match ArrowWriter. Is the appropriate fix then to add into_serialized_writer to AsyncArrowWriter instead of get_column_writers and append_row_group?

cc @lilianm

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

The motivation for #8262 seems to be making the AsyncArrowWriter API match ArrowWriter. Is the appropriate fix then to add into_serialized_writer to AsyncArrowWriter instead of get_column_writers and append_row_group?

cc @lilianm

I am not sure -- what I would love to see is an example like this https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html#example-encoding-two-arrow-arrays-in-parallel

That shows how to use whatever APIs we have to write data in parallel

I am sorry I haven't found time to study this PR in detail. @adamreeve it sounds like you have some good ideas and have devoted time here. What would you recommend?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I went over this PR quite carefully this morning, and reviewed all the PRs / APIs and I think it makes sense. Thank you @rok and @adamreeve

Divergent Writing APIs

One thought I had (which we can do as a follow on PR) would be to unify the APIs for doing concurrent writing so they always use ArrowRowGroupWriterFactory, which I think would mean:

  1. Deprecate get_column_writers
  2. Make ArrowRowGroupWriterFactory::new public
  3. Update the examples to use ArrowRowGroupWriterFactory

If this sounds reasonable, I can file a ticket.

I will also see if I can debug the test failures

}

/// Create a new row group writer and return its column writers.
pub async fn get_column_writers(&mut self) -> Result<Vec<ArrowColumnWriter>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and this code is not yet released, so this is not a public API change. It was added in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very fresh indeed. I hope we're not spoiling some plan here. We did find the ArrowRowGroupWriterFactory route better so I think it should be ok.

}

/// Create a new row group writer and return its column writers.
#[deprecated(since = "56.2.0", note = "Use into_serialized_writer instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself, these APIs were added in


let mut writers = writer.get_column_writers().await.unwrap();
// Use low-level API to write an Arrow group
let arrow_writer = writer.sync_writer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to use public APIs in the tests, rather than accessing the inner fields directly (which is not possible from other crates)


/// Converts this writer into a lower-level [`SerializedFileWriter`] and [`ArrowRowGroupWriterFactory`].
/// This can be useful to provide more control over how files are written.
pub fn into_serialized_writer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite some time trying to figure out why this API is needed -- specifically "why do we need an ArrowWriter at all, why not use SerializedFileWriter and get_column_writers directly, as shown in this example

After study I concluded the reason we need to expose ArrowRowGroupWriterFactory is that ArrowRowGroupWriterFactory::create_column_writers also has the appropriate encryption properties.

It is unfortunate that we'll now have two different sets of APIs for creating column writers -- via get_column_writers AND ArrowRowGroupWriterFactory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a ticket with a suggestion for a unified API:

Ok(())
}

/// Converts this writer into a lower-level [`SerializedFileWriter`] and [`ArrowRowGroupWriterFactory`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite a while trying to figure out why we can't just use get_column_writers as in the example to use ArrowRowGroupWriterFactory and I (finally) realized the reason is the encryption configuration isn't passed to get_column_writers. ArrowRowGroupWriterFactory does have the encryption details and thus can make the correct ArrowColumnWriters.

I think it is somewhat of a strange API to create an ArrowWriter only to immediately destructure it into a SerializedWriter / the underlying writer. It is also unfortunate we now have two different APIs for writing row groups in parallel, depending on encryption.

I have an idea to make the APIs better as a follow on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's odd. Motivation was to introduce as few new pubs as possible. Would be very curious about alternative API shapes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8162 (review) is my suggestion , basically TLDR is to make ArrowRowGroupWriterFactory constructor public and deprecate get_column_writers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite a while trying to figure out why we can't just use get_column_writers.

Sorry, I realise now that I didn't make this very clear in my original issue (#7359).

One other factor is that we couldn't just use the WriterProperties passed to get_column_writers to internally create a new FileEncryptor. When a FileEncryptor is created for the SerializedFileWriter, it generates random AAD (additional authentication data), and this AAD has to be the same for all encrypted modules in the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a ticket with a suggestion for a unified API:

}

/// Create column writers for a new row group.
pub fn create_column_writers(&self, row_group_index: usize) -> Result<Vec<ArrowColumnWriter>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that this is the key API -- create column writers with the relevant encryption properties, if relevant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you thinks to set create_row_group_writer pub and remove this function and use function into_writers add in issue #8260

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am not sure making ArrowRowGroupWriter public gets us much of anything, and it would not allow per-column parallel encoding

One benefit of getting the column writers individually, is that then the columns can be encoded in parallel. The ArrowRowGroupWriter can only write RowGroups in parallel.

I looked at ArrowRowGroupWriter a bit more, and the only substantial thing it does is call a loop with compute_leaves which is already public.

struct ArrowRowGroupWriter {
writers: Vec<ArrowColumnWriter>,
schema: SchemaRef,
buffered_rows: usize,
}

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the failing test (which was also added in #8262 which we removed in this PR too)

assert_eq!(to_write, read);
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we removed those APIs in this PR, we should also remove the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@rok
Copy link
Member Author

rok commented Sep 18, 2025

Sorry for the late response @alamb and thanks for the added changes. I think this is good as is and can be included in the release.

@alamb
Copy link
Contributor

alamb commented Sep 18, 2025

Sorry for the late response @alamb and thanks for the added changes. I think this is good as is and can be included in the release.

This was my fault for not reviewing this PR before in more detail.

I leave this open until tomorrow to give people a chance to respond, and if not I'll merge it in and make the RC

@adamreeve
Copy link
Contributor

I am sorry I haven't found time to study this PR in detail. @adamreeve it sounds like you have some good ideas and have devoted time here. What would you recommend?

I've read through the latest comments and it sounds like you have a good understanding of the motivation for this change now, thanks for taking a close look Andrew.

I agree it would be good to consolidate on using ArrowRowGroupWriterFactory for concurrent writing and update the examples. And also agree that creating an ArrowWriter just to convert it straight away to an ArrowRowGroupWriterFactory is a bit awkward. It does conveniently also create a SerializedFileWriter using the Arrow schema, but that's not too complicated to do explicitly, and the existing concurrent writing example already does this.

If you'd rather not have the new into_serialized_writer method and require creating the ArrowRowGroupWriterFactory directly, maybe we should take some more time to implement your suggested approach and wait for the next release. But I think into_serialized_writer is convenient and unlikely to cause much extra maintenance burden.

@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

If you'd rather not have the new into_serialized_writer method and require creating the ArrowRowGroupWriterFactory directly, maybe we should take some more time to implement your suggested approach and wait for the next release. But I think into_serialized_writer is convenient and unlikely to cause much extra maintenance burden.

Yeah I agree -- one method is not a big deal if we are going to go with the ArrowRowGroupWriterFactory

I'll merge this PR and file some follow on tickets shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is ready to go so I am going to merge it in

I filed a ticket to track simplifying / unifying the API here

@alamb alamb merged commit 322745d into apache:main Sep 19, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

Thank you @rok @lilianm and @adamreeve

alamb added a commit that referenced this pull request Sep 19, 2025
# Which issue does this PR close?

- related to #8162 


# Rationale for this change

- While reviewing #8162 I read a
bunch more of the parquet code and I wanted to capture some of my
understanding in comments.

# What changes are included in this PR?

Add more documentation to various parquet writing APIs

# Are these changes tested?

By CI

# Are there any user-facing changes?

Documentation only, no function changes

---------

Co-authored-by: Ed Seidl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Expose ArrowRowGroupWriter [Parquet] Concurrent writes with ArrowWriter.get_column_writers should parallelize across row groups
6 participants