-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Relax constraint that file sort order must only reference individual columns #17419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4e04412
to
4bbc81a
Compare
49a6c8a
to
5861425
Compare
…e individual columns
5861425
to
9453640
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pepijnve -- this looks good to me. I'll kick off some planning benchmarks just to make sure this doesn't affect them, but I don't expect to see any slowdown
} | ||
None => create_ordering(self.0.source.schema(), &self.0.order)?, | ||
let schema = self.0.source.schema(); | ||
let df_schema = DFSchema::try_from(Arc::clone(schema))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if (re)creating this DFSchema is necessary -- it feels like at this point we know the schema information
However, i also see we need to have a DFSchema to correctly create arbitrary PhysicalExprs so this is probably fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit concerned about the waste here as well, but I couldn't figure out a simple way to avoid this.
---- | ||
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/data/composite_order.csv]]}, projection=[a, b], output_ordering=[a@0 + b@1 ASC NULLS LAST], file_type=csv, has_header=true | ||
|
||
# Query ordered by the declared order should be just a table scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -3532,7 +3532,7 @@ physical_plan | |||
01)BoundedWindowAggExec: wdw=[sum(multiple_ordered_table.a) ORDER BY [multiple_ordered_table.b ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: Field { name: "sum(multiple_ordered_table.a) ORDER BY [multiple_ordered_table.b ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, frame: RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW], mode=[Sorted] | |||
02)--CoalesceBatchesExec: target_batch_size=4096 | |||
03)----FilterExec: b@2 = 0 | |||
04)------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a, b, c, d], output_orderings=[[a@1 ASC NULLS LAST, b@2 ASC NULLS LAST], [c@3 ASC NULLS LAST]], file_type=csv, has_header=true | |||
04)------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a, b, c, d], output_orderings=[[c@3 ASC NULLS LAST], [a@1 ASC NULLS LAST, b@2 ASC NULLS LAST]], file_type=csv, has_header=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why the output orderings come out in a different (reverse) order now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't take the time to try to understand why. That's how they're being emitted by the EquivalenceClass code. I had assumed the order was not important, but if it is I can take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it is important
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
I am a little worried about the reported slowdowns in the sql planning benchmarks. I'll try and reproduce them locally |
Perhaps a two step approach would be better then where we try the “column only” version first and only use the more complex code path as fallback. |
Which issue does this PR close?
Rationale for this change
The documentation states that
WITH ORDER
clauses may use non-trivial expressions. It even has an example showing the usage of this feature. In practice this does not work and the implementation is limited to simple column references.What changes are included in this PR?
physical_expr::create_lex_ordering
function that provides a more flexible version ofphysical_expr::create_ordering
.create_ordering
with its single column constraint has been retained for backwards compatibility, but should perhaps be deprecated. It does not seems possible to reimplement it in terms ofcreate_lex_ordering
since anExecutionProps
instance is required.physical_expr::equivalence::project_orderings
convenience function that uses the existing sort order projection logicAre these changes tested?
with order
caseAre there any user-facing changes?
PhysicalExpr
instances rather than onlyColumn
.