Skip to content

Conversation

@bubulalabu
Copy link

@bubulalabu bubulalabu commented Oct 22, 2025

LATERAL Table Function Support

Which issue does this PR close?

Closes #18121.

Rationale for this change

DataFusion's TableFunction API currently cannot access data from columns in outer queries, preventing LATERAL joins with table functions. Users attempting queries like:

SELECT t1.id, t2.x, t2.y
FROM my_table AS t1,
     LATERAL my_transform(t1.a, t1.b, t1.c) AS t2(x, y)

encounter errors: This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn.

What changes are included in this PR?

This PR attempts to add LATERAL table function support, enabling table functions to reference columns from outer queries. I'm sure there are improvements that can be made; feedback is very welcome

Core Components

  1. Logical Plan Node (LateralTableFunction)

    • Represents table function calls with outer column references
    • Stores both combined output schema and function-only schema for clarity
  2. Physical Execution Plan (LateralTableFunctionExec)

    • Evaluates table function once per input row
    • Handles outer column references by extracting values from each row
    • Combines input columns with function output
    • Implements streaming execution via RecordBatchStream
  3. SQL Planning (relation/join.rs)

    • Detects LATERAL table function calls during join planning
    • Performs schema inference using placeholder values when outer references exist
    • Creates LateralTableFunction nodes instead of regular scans
  4. Schema Inference

    • When table function arguments contain outer references, replaces them with placeholder values to infer output schema
    • Uses incrementing values (1, 2, 3...) to ensure valid ranges for functions like generate_series(start, end)
    • Stores inferred schema in logical plan for use during physical planning

Example Usage

-- Generate series based on table values
SELECT t.id, s.value
FROM my_table t,
     LATERAL generate_series(t.start_val, t.end_val) s;

-- Transform row data through UDTF
SELECT t.id, result.x, result.y
FROM data t,
     LATERAL my_transform(t.a, t.b, t.c) AS result(x, y);

Implementation Approach

I'd appreciate feedback on these design choices - there may be better approaches I haven't considered:

  1. Schema Inference via Placeholders

    The implementation replaces outer references with placeholder literal values (1, 2, 3...) to infer the table function's output schema. This works because table functions need concrete values to execute, but outer references aren't resolved until runtime.

    Known limitation: This approach has an inherent weakness - schema inference can fail if the placeholder values are semantically invalid for the function. For example, generate_series(start, end) will error during planning if the placeholder for start is greater than end, even though the actual runtime values might be valid. For the sake of getting it to work I used incrementing placeholder values, but that's by no means a robust solution.

    I considered the TableFunction provide explicit schema declarations. If there's a better approach for schema inference, I'd love to hear suggestions!

  2. Row-by-Row Sequential Execution

    The current implementation executes the table function once per input row, sequentially (not parallelized or batched). I chose this conservative approach because we don't have metadata about whether table functions have side effects, are thread-safe, maintain internal state, or can be safely executed in parallel.

    Trade-off: This has performance implications. Batched or parallel execution would likely be faster, but would require additional API changes to let table functions declare their safety characteristics.

3 Dual Schema Storage

The logical plan stores both the combined schema (input + function output) and the function-only schema. This is slightly redundant (we could derive one from the other), but I found it made the code clearer and easier to understand.

Error Handling

Updated error message for non-LATERAL table functions with column references:

Before: "Table functions with outer references are not yet supported. This requires LATERAL table function support."
After:  "Table function arguments cannot reference columns without LATERAL keyword. Use: FROM other_table, LATERAL table_func(other_table.column)"

Are these changes tested?

Yes, I added a dedicated sql logic test file.

Are there any user-facing changes?

New Functionality

Users can now use LATERAL with table functions to reference outer columns:

-- This now works! (previously failed)
SELECT t.id, s.value
FROM my_table t,
     LATERAL generate_series(t.min, t.max) s;

API Changes

1 ScalarValue::try_new_placeholder() (new)

  • Public utility method for creating placeholder values
  • Useful for schema inference scenarios

2 No Breaking Changes

  • All existing APIs remain unchanged
  • Backward compatible with existing table functions
  • Existing SQL queries continue to work

Performance Characteristics

  • LATERAL table functions execute once per input row (not batched)

Additional Context

Future Enhancements

Potential follow-up work (out of scope for this PR):

  1. Batch Execution: Execute table function on multiple rows at once (requires UDTF API changes)
  2. Parallel Execution: Partition input and execute table functions in parallel

Disclosure: This PR was developed with the assistance of an LLM and has been thoroughly reviewed and tested by me. All design decisions, code implementation, and testing were validated through manual review and execution.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate common Related to common crate proto Related to proto crate labels Oct 22, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bubulalabu -- I left some questions. Let me know what you think


let provider = self
.context_provider
.get_table_function_source(&tbl_func_name, placeholder_args)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be simpler we added a new function to the table provider so it could report its schema rather than making a fake set of arguments just to instantiate the function to get the schema

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the extension of the public API is an option, the following trait would be ideal to implement correlated table functions:

pub trait StreamingTableFunctionImpl: Send + Sync + Debug {
    fn name(&self) -> &str;
    fn signature(&self) -> &Signature;
    fn return_type(&self, arg_types: &[DataType]) -> Result<Schema>;
    fn create_plan(
        &self,
        args: Vec<Arc<dyn PhysicalExpr>>,
        input: Arc<dyn ExecutionPlan>,
        projected_schema: SchemaRef,
    ) -> Result<Arc<dyn ExecutionPlan>>;
}


use crate::execution::session_state::SessionState;

/// Execution plan for LATERAL table functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mirror the implementation in another system (e.g. duckdb)? It seems very special purpose

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't. DuckDB uses generic correlated query unnesting and batched table functions. We can do the same specially for table functions. But for that we need something like create_plan as described in the above comment.

The implementation in this PR is suboptimal. It's just the best I can come up with without extending the public API.

If the extension of the public API is an option, I'd ideally go with the new StreamingTableFunctionImpl trait and have an internal adapter that bridges current TableFunction implementations to adhere to StreamingTableFunctionImpl.

Copy link
Contributor

@alamb alamb Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation in this PR is suboptimal. It's just the best I can come up with without extending the public API.

I am not sure what you mean by this. This PR does include substantial API changes

Adding LogicalPlan::LateralTableFunction is a substantial change to the public API in my opinion -- it will require non trivial changes in downstream crates that match on LogicalPlan

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, you're absolutely right

I'm implementing the support on top a dedicated trait right now, and will present it in a couple of days. I think the approach chosen here is a dead end.

I'll close this PR for now.

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 25, 2025
@bubulalabu bubulalabu closed this Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support table inputs for user defined table functions

2 participants