Handle NPE on null vector columns #13938

nyxtom · 2025-08-27T22:36:49Z

Currently there is a null pointer exception being thrown when there is a null vector

java.lang.NullPointerException: Cannot invoke "Object.getClass()" because "vector" is null
        at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getPlainVectorAccessor(GenericArrowVectorAccessorFactory.java:224)
        at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getVectorAccessor(GenericArrowVectorAccessorFactory.java:110)
        at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessors.getVectorAccessor(ArrowVectorAccessors.java:54)
        at org.apache.iceberg.arrow.vectorized.ColumnVector.getVectorAccessor(ColumnVector.java:136)
        at org.apache.iceberg.arrow.vectorized.ColumnVector.<init>(ColumnVector.java:56)
        at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:54)
        at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:29)
        at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:145)
        at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:314)
        at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:190)

This only happens when I'm using the vectorized stream reading. This might fix the issue here

RussellSpitzer · 2025-08-27T22:57:04Z

Could you elaborate more on where this is occurring or is this just a library issue?

nyxtom · 2025-08-27T23:01:58Z

Could you elaborate more on where this is occurring or is this just a library issue?

This is related to this #10275

Unless there's a different/better way to do this

try (CloseableIterable<org.apache.iceberg.CombinedScanTask> tasks = scan.planTasks();
             ArrowReader arrowReader = new ArrowReader(scan, arrowBatchRows, true);
             CloseableIterator<ColumnarBatch> batches = arrowReader.open(tasks)) {

            // Initialize writer on first batch
            ArrowStreamWriter writer = null;
            boolean writerStarted = false;
            
            while (batches.hasNext() && !reachedLimit(totalRows, limit)) {
                ColumnarBatch batch = batches.next();
                
                if (batch == null) {
                    throw new IllegalStateException("Batch is null - ArrowReader returned invalid data");
                }
                
                // Initialize writer and schema on first batch
                if (!writerStarted) {
                    root = batch.createVectorSchemaRootFromVectors();
                    if (root == null) {
                        throw new IllegalStateException("Failed to create VectorSchemaRoot from batch");
                    }
                    
                    columnCount = root.getSchema().getFields().size();
                    logger.debug("VectorSchemaRoot created with {} columns", columnCount);
                    
                    writer = new ArrowStreamWriter(root, null, os);
                    writer.start();
                    writerStarted = true;
                    ttfbMs = (System.nanoTime() - t0) / 1_000_000;
                }
                
                // Write batch
                int rows = limitRows(batch.numRows(), totalRows, limit);
                root.setRowCount(rows);
                writer.writeBatch();
                
                totalRows += rows;
                batchIdx++;
            }
            
            if (writerStarted) {
                writer.end();
            } else {
                // empty result: write schema-only stream
                Schema arrowSchema = ArrowSchemaUtil.convert(scan.table().schema());
                columnCount = arrowSchema.getFields().size();

                tmpAllocator = new RootAllocator(Long.MAX_VALUE);
                root = VectorSchemaRoot.create(arrowSchema, tmpAllocator);

                ArrowStreamWriter writer = new ArrowStreamWriter(root, null, os);
                writer.start();
                writer.end();
            }
        }

nyxtom · 2025-08-28T17:43:40Z

@RussellSpitzer any updates on what you think? Trying to get this functionality landed as its currently blocking some internal tooling from functioning without an exclusion for some columns that encounter this behavior

RussellSpitzer · 2025-08-28T18:25:05Z

Ah I think the issue is that in our code in the library we assume that the Parquet Reader already has a project which only selects those columns which need to be read prior to opening the file. We have to do this anyway because we have to map the names in the schema to the names in the file based on field id's.

iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java

Lines 171 to 174 in cf74b65

    
           Types.NestedField icebergField = icebergSchema.findField(parquetFieldId); 
        
           if (icebergField == null) { 
        
             return null; 
        
           }

It seems like we aren't doing a similar thing with the Arrow reader?

Is that on track? I'm trying to figure this out but I think ideally we just don't try to read null vectors at all at a higher level?

nyxtom · 2025-08-28T18:36:52Z

Ah I think the issue is that in our code in the library we assume that the Parquet Reader already has a project which only selects those columns which need to be read prior to opening the file. We have to do this anyway because we have to map the names in the schema to the names in the file based on field id's.

iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java

Lines 171 to 174 in cf74b65

Types.NestedField icebergField = icebergSchema.findField(parquetFieldId);

if (icebergField == null) {

return null;

}

It seems like we aren't doing a similar thing with the Arrow reader?

Is that on track? I'm trying to figure this out but I think ideally we just don't try to read null vectors at all at a higher level?

That sounds correct

shangxinli · 2025-09-26T14:19:37Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java

+      return new NullAccessor(t);
+    }
+
+    // Primitive typed fast-paths return boxed nulls; callers should check nullability separately.


It seems missing getDecimal() and others

github-actions · 2025-10-27T00:19:17Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

Handle NPE on null vector columns

81f876d

github-actions bot added the arrow label Aug 27, 2025

shangxinli reviewed Sep 26, 2025

View reviewed changes

github-actions bot added the stale label Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle NPE on null vector columns #13938

Handle NPE on null vector columns #13938

nyxtom commented Aug 27, 2025

Uh oh!

RussellSpitzer commented Aug 27, 2025

Uh oh!

nyxtom commented Aug 27, 2025 •

edited

Loading

Uh oh!

nyxtom commented Aug 28, 2025

Uh oh!

RussellSpitzer commented Aug 28, 2025

Uh oh!

nyxtom commented Aug 28, 2025

Uh oh!

shangxinli Sep 26, 2025

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Handle NPE on null vector columns #13938

Are you sure you want to change the base?

Handle NPE on null vector columns #13938

Conversation

nyxtom commented Aug 27, 2025

Uh oh!

RussellSpitzer commented Aug 27, 2025

Uh oh!

nyxtom commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nyxtom commented Aug 28, 2025

Uh oh!

RussellSpitzer commented Aug 28, 2025

Uh oh!

nyxtom commented Aug 28, 2025

Uh oh!

shangxinli Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nyxtom commented Aug 27, 2025 •

edited

Loading