Skip to content

Conversation

@nyxtom
Copy link

@nyxtom nyxtom commented Aug 27, 2025

Currently there is a null pointer exception being thrown when there is a null vector

java.lang.NullPointerException: Cannot invoke "Object.getClass()" because "vector" is null
        at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getPlainVectorAccessor(GenericArrowVectorAccessorFactory.java:224)
        at org.apache.iceberg.arrow.vectorized.GenericArrowVectorAccessorFactory.getVectorAccessor(GenericArrowVectorAccessorFactory.java:110)
        at org.apache.iceberg.arrow.vectorized.ArrowVectorAccessors.getVectorAccessor(ArrowVectorAccessors.java:54)
        at org.apache.iceberg.arrow.vectorized.ColumnVector.getVectorAccessor(ColumnVector.java:136)
        at org.apache.iceberg.arrow.vectorized.ColumnVector.<init>(ColumnVector.java:56)
        at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:54)
        at org.apache.iceberg.arrow.vectorized.ArrowBatchReader.read(ArrowBatchReader.java:29)
        at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:145)
        at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:314)
        at org.apache.iceberg.arrow.vectorized.ArrowReader$VectorizedCombinedScanIterator.next(ArrowReader.java:190)

This only happens when I'm using the vectorized stream reading. This might fix the issue here

@github-actions github-actions bot added the arrow label Aug 27, 2025
@RussellSpitzer
Copy link
Member

Could you elaborate more on where this is occurring or is this just a library issue?

@nyxtom
Copy link
Author

nyxtom commented Aug 27, 2025

Could you elaborate more on where this is occurring or is this just a library issue?

This is related to this #10275

Unless there's a different/better way to do this

try (CloseableIterable<org.apache.iceberg.CombinedScanTask> tasks = scan.planTasks();
             ArrowReader arrowReader = new ArrowReader(scan, arrowBatchRows, true);
             CloseableIterator<ColumnarBatch> batches = arrowReader.open(tasks)) {

            // Initialize writer on first batch
            ArrowStreamWriter writer = null;
            boolean writerStarted = false;
            
            while (batches.hasNext() && !reachedLimit(totalRows, limit)) {
                ColumnarBatch batch = batches.next();
                
                if (batch == null) {
                    throw new IllegalStateException("Batch is null - ArrowReader returned invalid data");
                }
                
                // Initialize writer and schema on first batch
                if (!writerStarted) {
                    root = batch.createVectorSchemaRootFromVectors();
                    if (root == null) {
                        throw new IllegalStateException("Failed to create VectorSchemaRoot from batch");
                    }
                    
                    columnCount = root.getSchema().getFields().size();
                    logger.debug("VectorSchemaRoot created with {} columns", columnCount);
                    
                    writer = new ArrowStreamWriter(root, null, os);
                    writer.start();
                    writerStarted = true;
                    ttfbMs = (System.nanoTime() - t0) / 1_000_000;
                }
                
                // Write batch
                int rows = limitRows(batch.numRows(), totalRows, limit);
                root.setRowCount(rows);
                writer.writeBatch();
                
                totalRows += rows;
                batchIdx++;
            }
            
            if (writerStarted) {
                writer.end();
            } else {
                // empty result: write schema-only stream
                Schema arrowSchema = ArrowSchemaUtil.convert(scan.table().schema());
                columnCount = arrowSchema.getFields().size();

                tmpAllocator = new RootAllocator(Long.MAX_VALUE);
                root = VectorSchemaRoot.create(arrowSchema, tmpAllocator);

                ArrowStreamWriter writer = new ArrowStreamWriter(root, null, os);
                writer.start();
                writer.end();
            }
        }

@nyxtom
Copy link
Author

nyxtom commented Aug 28, 2025

@RussellSpitzer any updates on what you think? Trying to get this functionality landed as its currently blocking some internal tooling from functioning without an exclusion for some columns that encounter this behavior

@RussellSpitzer
Copy link
Member

Ah I think the issue is that in our code in the library we assume that the Parquet Reader already has a project which only selects those columns which need to be read prior to opening the file. We have to do this anyway because we have to map the names in the schema to the names in the file based on field id's.

Types.NestedField icebergField = icebergSchema.findField(parquetFieldId);
if (icebergField == null) {
return null;
}

It seems like we aren't doing a similar thing with the Arrow reader?

Is that on track? I'm trying to figure this out but I think ideally we just don't try to read null vectors at all at a higher level?

@nyxtom
Copy link
Author

nyxtom commented Aug 28, 2025

Ah I think the issue is that in our code in the library we assume that the Parquet Reader already has a project which only selects those columns which need to be read prior to opening the file. We have to do this anyway because we have to map the names in the schema to the names in the file based on field id's.

Types.NestedField icebergField = icebergSchema.findField(parquetFieldId);
if (icebergField == null) {
return null;
}

It seems like we aren't doing a similar thing with the Arrow reader?

Is that on track? I'm trying to figure this out but I think ideally we just don't try to read null vectors at all at a higher level?

That sounds correct

return new NullAccessor(t);
}

// Primitive typed fast-paths return boxed nulls; callers should check nullability separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems missing getDecimal() and others

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants