Data, Flink, Spark: Use TestHelpers for FormatVersion #13880

RussellSpitzer · 2025-08-20T22:46:24Z

While I was working on V4 Parquet Manifests I found a bunch of test classes that do not properly parameterize formatVersion and don't run on V4. I've fixed all the ones I could find.

RussellSpitzer · 2025-08-20T22:47:21Z

flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergCommitter.java

    List<Object> parameters = Lists.newArrayList();
    for (Boolean isStreamingMode : new Boolean[] {true, false}) {
-      for (int formatVersion : new int[] {1, 2}) {
+      for (int formatVersion : TestHelpers.ALL_VERSIONS) {


Someone who is a flink expert maybe @stevenzwu or @pvary let me know if it was intentional that this test not check all versions or not

Here it is not intentional.

RussellSpitzer · 2025-08-20T22:47:40Z

...k/v1.20/flink/src/test/java/org/apache/iceberg/flink/actions/TestRewriteDataFilesAction.java

        new FileFormat[] {FileFormat.AVRO, FileFormat.ORC, FileFormat.PARQUET}) {
      for (Object[] catalogParams : CatalogTestBase.parameters()) {
-        for (int version : Arrays.asList(2, 3)) {
+        for (int version : TestHelpers.V2_AND_ABOVE) {


Here we have tests like testRewriteNoConflictWithEqualityDeletes which will not work with V1.
I think @nastra intentionally only added V2 and V3 - maybe adding V4 would be nice.

RussellSpitzer · 2025-08-20T22:47:49Z

flink/v1.20/flink/src/test/java/org/apache/iceberg/flink/sink/TestIcebergCommitter.java

    List<Object> parameters = Lists.newArrayList();
    for (Boolean isStreamingMode : new Boolean[] {true, false}) {
-      for (int formatVersion : new int[] {1, 2}) {
+      for (int formatVersion : TestHelpers.ALL_VERSIONS) {


RussellSpitzer · 2025-08-20T22:48:45Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

-      new Object[] {FileFormat.PARQUET, 3, false, PlanningMode.DISTRIBUTED},
-      new Object[] {FileFormat.PARQUET, 3, true, PlanningMode.LOCAL},
-    };
+    List<Object[]> parameters = Lists.newArrayList();


We had a really weird pattern of what we tested here before, I mimicked it below, only testing V4 with parquet

stevenzwu

this looks good to me after CI failure is fixed. Thanks for doing this. With these examples, hopefully new tests will be able to follow the same pattern.

It is hard to be 100% accurate because they are just plain numbers (not enum).

RussellSpitzer · 2025-08-21T17:24:46Z

this looks good to me after CI failure is fixed. Thanks for doing this. With these examples, hopefully new tests will be able to follow the same pattern.

It is hard to be 100% accurate because they are just plain numbers (not enum).

Yep, let me keep fixing cli issues, there are so many configs so I've been leaning on the automated cli to find any remaining bugs.

ebyhr · 2025-08-22T03:29:49Z

spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkMetadataColumns.java

+      parameters.add(new Object[] {FileFormat.ORC, false, version});
+      parameters.add(new Object[] {FileFormat.ORC, true, version});
+    }
+    return parameters.toArray(new Object[0][]);


Why are we only updating v3.4? What about v4.0 and v3.5?

Ah I missed that, Sorry I've been going back and forth between build configs!

RussellSpitzer · 2025-08-22T22:02:15Z

I couldn't figure out exactly what the memory leak is in our test suite that's causing an Issue but it seems like it's related to the task statuses never getting cleared from the Spark context during the TestRewriteDataFilesAction test suite. Because the suite now runs within an additional config, the number of tasks increased dramatically and I believe this was the base cause of the OOM.

I tried disabling the UI but that didn't seem to help in any way, the statuses still stuck around.

So I decided to take a different tack and just optimize the test suite instead. The main thing I did is to go through and take all of the "Spark Sorts" and switch them to normal Java collection sorts. This has two outcomes, first the test suite runs much faster since we had adaptive shuffle disabled for this suite and it had to do 200 tasks per sort and because local sort is much faster than using the Spark mechanism. Second, the number of tasks is reduced dramatically which decreases the amount of "Task Status" objects that hang around.

If this ends up still being an issue in the future we can either track down the status issue or move some these tests into a different test suite.

RussellSpitzer · 2025-08-24T16:16:27Z

Ok apparently that wasn't enough, fixed it locally but on gradle i'm still getting an Exited with Return Code 52 error from the workflow

RussellSpitzer · 2025-08-25T21:56:55Z

Ok the issues is with Direct memory ... So Now I have to track down memory usage in our Arrow Readers

RussellSpitzer · 2025-08-25T22:17:27Z

Ok So Compaction is really leaking direct memory (at least in our tests)

Spark 3.4 - Testing

TestRewriteDataFilesActionStarts at 5:06
-- testBinPackCombineMixedFiles. 5:11

TestRewritePositionDeleteFilesAction 5:15
OOM Direct Memory

RussellSpitzer · 2025-08-25T22:19:50Z

My current understanding is that at least within our compaction code (and rewrite delete files action code) we are leaking memory. The biggest contributor is TestCombineMixedFiles (probably because of the amount of data?). When I added additional test versions the leak just got worse.

When I removed the Spark Shuffles I reduced the leakage somehow, but that just pushed out the OOM since the memory is never released

My current targets are -
Memory leak during spark shuffle?
Memory Leak during our vectorized read code?

My guesses are these are culprits because we see this happen a ton during compaction but not really anywhere else in the test suite.

Tomorrow i'm going to disable our vectorized read path, if that fixes the issue then I'll have narrowed it down a bit more

RussellSpitzer · 2025-08-26T14:57:08Z

With vectorized reads disabled (Still a tiny leak but if you look at the scale, you'll see it's in the range of 0 to 13MB)

So we definitely have some issue with the vectorized read path

RussellSpitzer · 2025-08-26T15:24:43Z

Test For Memory Leak, still working on nailing down where this is happening but it's unrelated to the test parameterization.

diff --git a/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestParquetVectorizedScan.java b/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestParquetVectorizedScan.java
index a6b5166b3..316ef762e 100644
--- a/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestParquetVectorizedScan.java
+++ b/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestParquetVectorizedScan.java
@@ -18,9 +18,163 @@
  */
 package org.apache.iceberg.spark.source;
 
+import static org.apache.iceberg.Files.localOutput;
+import static org.assertj.core.api.Assertions.assertThat;
+
+import java.io.File;
+import java.io.IOException;
+import java.lang.management.BufferPoolMXBean;
+import java.lang.management.ManagementFactory;
+import java.nio.file.Path;
+import java.util.List;
+import java.util.UUID;
+import org.apache.avro.generic.GenericData;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.DataFiles;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.io.FileAppender;
+import org.apache.iceberg.parquet.Parquet;
+import org.apache.iceberg.spark.data.RandomData;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
 public class TestParquetVectorizedScan extends TestParquetScan {
+  
+  private static final Configuration CONF = new Configuration();
+  
+  @TempDir private Path temp;
+
   @Override
   protected boolean vectorized() {
     return true;
   }
+
+  /**
+   * Test to verify that direct memory used during vectorized parquet reading is properly released.
+   * This creates a large (128MB) parquet file, reads it using the vectorized reader, collects all
+   * results, and verifies that direct memory is released after the read operation completes.
+   */
+  @Test
+  public void testDirectMemoryReleaseAfterLargeVectorizedRead() throws IOException {
+    // Create a schema with enough columns to generate significant data
+    org.apache.iceberg.Schema schema =
+        new org.apache.iceberg.Schema(
+            Types.NestedField.required(1, "id", Types.LongType.get()),
+            Types.NestedField.required(2, "data1", Types.StringType.get()),
+            Types.NestedField.required(3, "data2", Types.StringType.get()),
+            Types.NestedField.required(4, "data3", Types.StringType.get()),
+            Types.NestedField.required(5, "data4", Types.StringType.get()),
+            Types.NestedField.required(6, "data5", Types.StringType.get()),
+            Types.NestedField.required(7, "number1", Types.DoubleType.get()),
+            Types.NestedField.required(8, "number2", Types.DoubleType.get()));
+
+    File location = temp.resolve("memory_leak_test").toFile();
+
+    HadoopTables tables = new HadoopTables(CONF);
+    Table table = tables.create(schema, PartitionSpec.unpartitioned(), location.toString());
+    configureTable(table);
+
+    List<GenericData.Record> records = RandomData.generateList(schema, 1000000, 42L);
+
+    // Write the large parquet file
+    File dataFolder = new File(table.location(), "data");
+    File parquetFile = new File(dataFolder, FileFormat.PARQUET.addExtension(UUID.randomUUID().toString()));
+
+    try (FileAppender<GenericData.Record> writer =
+        Parquet.write(localOutput(parquetFile)).schema(schema).build()) {
+      writer.addAll(records);
+    }
+
+    // Verify the file is actually large enough (~128MB)
+    long fileSizeBytes = parquetFile.length();
+    assertThat(fileSizeBytes)
+        .as("Generated file should be at least 50MB")
+        .isGreaterThan(50L * 1024 * 1024);
+
+    DataFile file =
+        DataFiles.builder(PartitionSpec.unpartitioned())
+            .withFileSizeInBytes(fileSizeBytes)
+            .withPath(parquetFile.toString())
+            .withRecordCount(records.size())
+            .build();
+
+    table.newAppend().appendFile(file).commit();
+
+    // Get direct memory usage before reading
+    long directMemoryBefore = getDirectMemoryUsed();
+
+    // Read the file using vectorized parquet reader and collect all results
+    Dataset<Row> df = spark.read().format("iceberg").load(table.location());
+    List<Row> rows = df.collectAsList();
+
+    // Get direct memory usage after reading but before cleanup
+    long directMemoryAfterRead = getDirectMemoryUsed();
+
+    // Verify we read the expected number of rows
+    assertThat(rows).as("Should contain all records").hasSize(records.size());
+
+    // Clear the collected data to release references
+    rows = null;
+    df = null;
+
+    // Force garbage collection to ensure any memory that should be released is released
+    System.gc();
+    System.gc();
+
+    // Wait a bit for GC to complete
+    try {
+      Thread.sleep(1000);
+    } catch (InterruptedException e) {
+      Thread.currentThread().interrupt();
+    }
+
+    // Get direct memory usage after cleanup
+    long directMemoryAfterCleanup = getDirectMemoryUsed();
+
+    // Calculate memory increases
+    long memoryIncreaseFromRead = directMemoryAfterRead - directMemoryBefore;
+    long memoryLeakAfterCleanup = directMemoryAfterCleanup - directMemoryBefore;
+
+    // Log memory usage for debugging
+    System.out.printf(
+        "Direct memory usage - Before: %d bytes, After read: %d bytes, After cleanup: %d bytes%n",
+        directMemoryBefore, directMemoryAfterRead, directMemoryAfterCleanup);
+    System.out.printf(
+        "Memory increase from read: %d bytes, Potential leak: %d bytes%n",
+        memoryIncreaseFromRead, memoryLeakAfterCleanup);
+
+    // We expect some memory to be used during reading (this verifies the test is actually testing something meaningful)
+    assertThat(memoryIncreaseFromRead)
+        .as("Reading a large file should use some direct memory")
+        .isGreaterThan(0);
+
+    // The key assertion: after cleanup, direct memory usage should return close to the initial level
+    // We allow for some small variance (1MB) due to JVM internals and other concurrent operations
+    long allowableMemoryVariance = 1024 * 1024; // 1MB
+    assertThat(memoryLeakAfterCleanup)
+        .as("Direct memory should be released after reading (potential memory leak detected)")
+        .isLessThanOrEqualTo(allowableMemoryVariance);
+  }
+
+  /**
+   * Gets the current direct memory usage by summing up all direct buffer pools.
+   */
+  private long getDirectMemoryUsed() {
+    List<BufferPoolMXBean> bufferPoolMXBeans =
+        ManagementFactory.getPlatformMXBeans(BufferPoolMXBean.class);
+    long directMemory = 0;
+    for (BufferPoolMXBean bufferPoolMXBean : bufferPoolMXBeans) {
+      if (bufferPoolMXBean.getName().equals("direct")) {
+        directMemory += bufferPoolMXBean.getMemoryUsed();
+      }
+    }
+    return directMemory;
+  }
 }

RussellSpitzer · 2025-08-26T17:39:15Z

 @TestTemplate
  public void testReadRepeated() {
    Table table = createTable(1); // 400000
    shouldHaveFiles(table, 1);

    // Add one more small file, and one large file
    writeRecords(1, SCALE * 3);
    int i = 0;

    while (i < 100) {
      System.out.println(currentData().size());
      System.out.println("Arrow Allocations : " + ArrowAllocation.rootAllocator().getAllocatedMemory());
      i++;
    }
  }

Less complicated repo, this test will OOM after 40 iteratiors (default offheap is 4GB~)

RussellSpitzer · 2025-08-26T19:15:10Z

Ok doing some more investigating, and we are always allocating 1 more VectorizedReaderBuilder than we use. IE if you have 1 Spark task we make 2 readers. If you have 110 tasks we make 111 readers. The final reader is never closed ... Trying to track down who is making it now

I was wrong, it's more like the last column in the last reader of the last task allocates more memory than it frees

RussellSpitzer · 2025-08-26T22:17:47Z

I figured it out:

In VectorizedArrowReader

  private void allocateFieldVector(boolean dictionaryEncodedVector) {
    if (dictionaryEncodedVector) {
      allocateDictEncodedVector();
    } else {
      Field arrowField = ArrowSchemaUtil.convert(getPhysicalType(columnDescriptor, icebergField));
      if (columnDescriptor.getPrimitiveType().getOriginalType() != null) {
        allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(), arrowField);
      } else {
        allocateVectorBasedOnTypeName(columnDescriptor.getPrimitiveType(), arrowField);
      }
    }
  }

Makes the assumption that all pages will have the same encoding. This is a big problem if the first page is dictionary encoded and the following ones are not. The first pass by this function will call

allocateDictEncodedVector()

Which does this

    this.vec = field.createVector(rootAlloc);
    ((IntVector) vec).allocateNew(batchSize);

But what happens if we then read a non-dictionary encoded page? We will then go down the other path, AllocateVectorBasedOnOriginalType, And hit this

     switch (primitive.getOriginalType()) {
      case ENUM:
      case JSON:
      case UTF8:
      case BSON:
        this.vec = arrowField.createVector(rootAlloc);
        // TODO: Possibly use the uncompressed page size info to set the initial capacity
        vec.setInitialCapacity(batchSize * AVERAGE_VARIABLE_WIDTH_RECORD_SIZE);
        vec.allocateNewSafe();
        this.readType = ReadType.VARCHAR;
        this.typeWidth = UNKNOWN_WIDTH;
        break;

Which will create a new vector for this.vec causing us to lose our first vector.

This is easy enough to fix, in both of these functions we just need to clear out "this.vec" if it is set

So in the case above we have dictionary encoded pages which allocate IntVectors which are then replaced with BaseVarWidthVectors for non encoded pages. This means we drop the previous vector and it's allocation with each swap. The more pages the worse it is.

pvary · 2025-08-27T08:14:38Z

CC: @nandorKollar

nandorKollar · 2025-08-27T09:35:33Z

So in the case above we have dictionary encoded pages which allocate IntVectors which are then replaced with BaseVarWidthVectors for non encoded pages. This means we drop the previous vector and it's allocation with each swap. The more pages the worse it is.

Why did they type of the vector change from IntVectors to BaseVarWidthVectors? If we clear out "this.vec" if it is set, wouldn't this type change in the vector cause problems? Shouldn't we explicitly close the this.vec if it is not null, before setting it to a new vector?

RussellSpitzer · 2025-08-27T14:32:35Z

Why did they type of the vector change from IntVectors to BaseVarWidthVectors?

The vector changes because Dictionary encoded pages are a sequence of ints, {1, 2, 3, 4} that refer to entries in the Dictionary which maps the int to the actual column value. {1: "foo", 2: "bar", ....}. Other pages have literal representations of the values stored as binary {foo, bar, bazz }. So you have to switch vector types when you alternate.

If we clear out "this.vec" if it is set, wouldn't this type change in the vector cause problems? Shouldn't we explicitly close the this.vec if it is not null, before setting it to a new vector?

No. To be clear, the code has always cleared out this.vec and we dont' have correctness issues because essentially what is happening is:

Reader looks to see if it can read the page
If it can't re-use the container do an allocate for the correct container

What is missing here is
2.a If I previously had a container but it cannot be re-used, clear it

nandorKollar · 2025-08-27T14:51:50Z

Why did they type of the vector change from IntVectors to BaseVarWidthVectors?

The vector changes because Dictionary encoded pages are a sequence of ints, {1, 2, 3, 4} that refer to entries in the Dictionary which maps the int to the actual column value. {1: "foo", 2: "bar", ....}. Other pages have literal representations of the values stored as binary {foo, bar, bazz }. So you have to switch vector types when you alternate.

If we clear out "this.vec" if it is set, wouldn't this type change in the vector cause problems? Shouldn't we explicitly close the this.vec if it is not null, before setting it to a new vector?

No. To be clear, the code has always cleared out this.vec and we dont' have correctness issues because essentially what is happening is:

Reader looks to see if it can read the page

If it can't re-use the container do an allocate for the correct container

What is missing here is 2.a If I previously had a container but it cannot be re-used, clear it

Thanks for clarifying why the type change happens, makes sense. We can't reuse the vector, only when there's a switch from/to dictionary encoded pages, right? When you mention, that it is always cleared, you mean the the value count is set to 0 in this block:

    if (reuse == null
        || (!dictEncoded && readType == ReadType.DICTIONARY)
        || (dictEncoded && readType != ReadType.DICTIONARY)) {
      allocateFieldVector(dictEncoded);
      nullabilityHolder = new NullabilityHolder(batchSize);
    } else {
      vec.setValueCount(0);
      nullabilityHolder.reset();
    }

RussellSpitzer · 2025-08-27T16:50:50Z

Previously it would just completely drop the reference to the previous vec so it didn't matter. (That's the leak). I'll raise a new PR with a fix. Also fun fact, we already have a test THAT WOULD FAIL if we were actually tracking memory usage. See TestParquetDictionaryEncodedVectorizedReads.testMixedDictionaryNonDictionaryReads. One sec and i'll have a new PR up to show this

…d 4.0

RussellSpitzer · 2025-09-02T15:48:40Z

Rebased now that the leak is gone, removed test speedups also being done in a separate PR #13947

amogh-jahagirdar

Thank you @RussellSpitzer

RussellSpitzer · 2025-09-02T17:41:41Z

Finally Merged! Thanks everyone who helped with this and the associated memory leak issue.

Thanks to everyone:
@pvary
@amogh-jahagirdar
@ebyhr
@stevenzwu
@nastra

github-actions bot added spark data flink labels Aug 20, 2025

RussellSpitzer commented Aug 20, 2025

View reviewed changes

RussellSpitzer force-pushed the AddMoreV4Test branch 2 times, most recently from 82d0e3d to 1b21e1d Compare August 20, 2025 22:58

RussellSpitzer requested review from amogh-jahagirdar, nastra, pvary, rdblue and stevenzwu August 20, 2025 22:59

RussellSpitzer force-pushed the AddMoreV4Test branch from 1b21e1d to 81a988e Compare August 21, 2025 00:34

nastra approved these changes Aug 21, 2025

View reviewed changes

stevenzwu approved these changes Aug 21, 2025

View reviewed changes

ebyhr reviewed Aug 22, 2025

View reviewed changes

RussellSpitzer mentioned this pull request Aug 27, 2025

Arrow, Spark: Fix Direct Memory Leak on Vectorized Parquet Mixed Encoding Pages #13935

Merged

RussellSpitzer added 2 commits September 2, 2025 10:46

Data, Flink, Spark: Use TestHelpers for FormatVersion

98c1b80

Review: Add missing Versioning for TestSparkMetadataColumns in 3.5 an…

846e77a

…d 4.0

RussellSpitzer force-pushed the AddMoreV4Test branch from a260188 to 846e77a Compare September 2, 2025 15:46

amogh-jahagirdar approved these changes Sep 2, 2025

View reviewed changes

RussellSpitzer merged commit 8bcfa61 into apache:main Sep 2, 2025
42 checks passed

Data, Flink, Spark: Use TestHelpers for FormatVersion #13880

Data, Flink, Spark: Use TestHelpers for FormatVersion #13880

Uh oh!

Conversation

RussellSpitzer commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Aug 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Aug 22, 2025

Uh oh!

RussellSpitzer commented Aug 24, 2025

Uh oh!

RussellSpitzer commented Aug 25, 2025

Uh oh!

RussellSpitzer commented Aug 25, 2025

Uh oh!

RussellSpitzer commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Aug 26, 2025

Uh oh!

RussellSpitzer commented Aug 26, 2025

Uh oh!

RussellSpitzer commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pvary commented Aug 27, 2025

Uh oh!

nandorKollar commented Aug 27, 2025

Uh oh!

RussellSpitzer commented Aug 27, 2025

Uh oh!

nandorKollar commented Aug 27, 2025

Uh oh!

RussellSpitzer commented Aug 27, 2025

Uh oh!

RussellSpitzer commented Sep 2, 2025

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RussellSpitzer commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

RussellSpitzer commented Aug 25, 2025 •

edited

Loading

RussellSpitzer commented Aug 26, 2025 •

edited

Loading

RussellSpitzer commented Aug 26, 2025 •

edited

Loading

RussellSpitzer commented Aug 26, 2025 •

edited

Loading