-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Add LATERAL table function support (#18121) #18224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add LATERAL table function support (#18121) #18224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bubulalabu -- I left some questions. Let me know what you think
|
|
||
| let provider = self | ||
| .context_provider | ||
| .get_table_function_source(&tbl_func_name, placeholder_args)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be simpler we added a new function to the table provider so it could report its schema rather than making a fake set of arguments just to instantiate the function to get the schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the extension of the public API is an option, the following trait would be ideal to implement correlated table functions:
pub trait StreamingTableFunctionImpl: Send + Sync + Debug {
fn name(&self) -> &str;
fn signature(&self) -> &Signature;
fn return_type(&self, arg_types: &[DataType]) -> Result<Schema>;
fn create_plan(
&self,
args: Vec<Arc<dyn PhysicalExpr>>,
input: Arc<dyn ExecutionPlan>,
projected_schema: SchemaRef,
) -> Result<Arc<dyn ExecutionPlan>>;
}
|
|
||
| use crate::execution::session_state::SessionState; | ||
|
|
||
| /// Execution plan for LATERAL table functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mirror the implementation in another system (e.g. duckdb)? It seems very special purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't. DuckDB uses generic correlated query unnesting and batched table functions. We can do the same specially for table functions. But for that we need something like create_plan as described in the above comment.
The implementation in this PR is suboptimal. It's just the best I can come up with without extending the public API.
If the extension of the public API is an option, I'd ideally go with the new StreamingTableFunctionImpl trait and have an internal adapter that bridges current TableFunction implementations to adhere to StreamingTableFunctionImpl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation in this PR is suboptimal. It's just the best I can come up with without extending the public API.
I am not sure what you mean by this. This PR does include substantial API changes
Adding LogicalPlan::LateralTableFunction is a substantial change to the public API in my opinion -- it will require non trivial changes in downstream crates that match on LogicalPlan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, you're absolutely right
I'm implementing the support on top a dedicated trait right now, and will present it in a couple of days. I think the approach chosen here is a dead end.
I'll close this PR for now.
LATERAL Table Function Support
Which issue does this PR close?
Closes #18121.
Rationale for this change
DataFusion's
TableFunctionAPI currently cannot access data from columns in outer queries, preventing LATERAL joins with table functions. Users attempting queries like:encounter errors:
This feature is not implemented: Physical plan does not support logical expression OuterReferenceColumn.What changes are included in this PR?
This PR attempts to add LATERAL table function support, enabling table functions to reference columns from outer queries. I'm sure there are improvements that can be made; feedback is very welcome
Core Components
Logical Plan Node (
LateralTableFunction)Physical Execution Plan (
LateralTableFunctionExec)RecordBatchStreamSQL Planning (
relation/join.rs)LateralTableFunctionnodes instead of regular scansSchema Inference
generate_series(start, end)Example Usage
Implementation Approach
I'd appreciate feedback on these design choices - there may be better approaches I haven't considered:
Schema Inference via Placeholders
The implementation replaces outer references with placeholder literal values (1, 2, 3...) to infer the table function's output schema. This works because table functions need concrete values to execute, but outer references aren't resolved until runtime.
Known limitation: This approach has an inherent weakness - schema inference can fail if the placeholder values are semantically invalid for the function. For example,
generate_series(start, end)will error during planning if the placeholder forstartis greater thanend, even though the actual runtime values might be valid. For the sake of getting it to work I used incrementing placeholder values, but that's by no means a robust solution.I considered the TableFunction provide explicit schema declarations. If there's a better approach for schema inference, I'd love to hear suggestions!
Row-by-Row Sequential Execution
The current implementation executes the table function once per input row, sequentially (not parallelized or batched). I chose this conservative approach because we don't have metadata about whether table functions have side effects, are thread-safe, maintain internal state, or can be safely executed in parallel.
Trade-off: This has performance implications. Batched or parallel execution would likely be faster, but would require additional API changes to let table functions declare their safety characteristics.
3 Dual Schema Storage
The logical plan stores both the combined schema (input + function output) and the function-only schema. This is slightly redundant (we could derive one from the other), but I found it made the code clearer and easier to understand.
Error Handling
Updated error message for non-LATERAL table functions with column references:
Are these changes tested?
Yes, I added a dedicated sql logic test file.
Are there any user-facing changes?
New Functionality
Users can now use LATERAL with table functions to reference outer columns:
API Changes
1
ScalarValue::try_new_placeholder()(new)2 No Breaking Changes
Performance Characteristics
Additional Context
Future Enhancements
Potential follow-up work (out of scope for this PR):
Disclosure: This PR was developed with the assistance of an LLM and has been thoroughly reviewed and tested by me. All design decisions, code implementation, and testing were validated through manual review and execution.