- 
                Notifications
    You must be signed in to change notification settings 
- Fork 177
          Support chart command in PPL
          #4579
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8297023    to
    6b8934e      
    Compare
  
    chart command in PPLchart command in PPL
      Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]> # Conflicts: # core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java # integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code explanation.
|  | ||
| // Convert the column split to string if necessary: column split was supposed to be pivoted to | ||
| // column names. This guarantees that its type compatibility with useother and usenull | ||
| RexNode colSplit = relBuilder.field(1); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fields are [row-split, col-split, aggregation] now
| if (!SqlTypeUtil.isCharacter(colSplit.getType())) { | ||
| colSplit = | ||
| relBuilder.alias( | ||
| context.rexBuilder.makeCast( | ||
| UserDefinedFunctionUtils.NULLABLE_STRING, colSplit, true, true), | ||
| columSplitName); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert the column split to string so that they can be labels of columns once pivoted. This also guarantees that its type is compatible with nullstr and otherstr.
| // 1: column-split, 2: agg | ||
| relBuilder.project(relBuilder.field(1), relBuilder.field(2)); | ||
| // Make sure that rows who don't have a column split not interfere grand total calculation | ||
| relBuilder.filter(relBuilder.isNotNull(relBuilder.field(0))); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testChartWithNullAndLimit covers this case. Without this line, it will number rows who don't have a column split if their aggregation result is great.
| // Apply sorting: for MIN/EARLIEST, reverse the top/bottom logic | ||
| boolean smallestFirst = | ||
| aggFunction == BuiltinFunctionName.MIN || aggFunction == BuiltinFunctionName.EARLIEST; | ||
| if (config.top != smallestFirst) { | ||
| grandTotal = relBuilder.desc(grandTotal); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See explanations in #4594
| relBuilder.push(aggregated); | ||
| relBuilder.push(ranked); | ||
|  | ||
| // on column-split = group key | ||
| relBuilder.join( | ||
| JoinRelType.LEFT, relBuilder.equals(relBuilder.field(2, 0, 1), relBuilder.field(2, 1, 0))); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggregated: [row-split, col-split, aggregation]
ranked: [col-split, grand-total, row-number]
| relBuilder.aggregate( | ||
| relBuilder.groupKey(relBuilder.field(0), relBuilder.field(1)), | ||
| buildAggCall(context.relBuilder, aggFunction, relBuilder.field(2)).as(aggFieldName)); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final aggregation: to merge values in the OTHER categories.
        
          
                docs/user/ppl/cmd/chart.rst
              
                Outdated
          
        
      | * **limit**: optional. Specifies the number of distinct values to display when using column split. | ||
|  | ||
| * Default: 10 | ||
| * Syntax: ``limit=(top|bottom) <number>`` or ``limit=<number>`` (defaults to top) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain definition of top in doc,  is it stats dc by col | sort -dc, +col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to keep the top-K categories (distinct column splits).
E.g.
- chart limit=1 count() by a bkeeps the top 1 b categories with most rows
- chart limit=bottom3 sum(value) by a bkeeps the one b categories with minimum sum of values in its category.
- chart limit=top2 min(value) by a bkeeps 2 b categories whose minimum value within its group are the smallest 2.
- chart limit=bottom2 min(value) by a bkeep 2 b categories whose minimum value within its group are the largest 2.
        
          
                docs/user/ppl/cmd/chart.rst
              
                Outdated
          
        
      | * Set to 0 to show all distinct values without any limit. | ||
| * Only applies when column split presents (by 2 fields or over...by... coexists). | ||
|  | ||
| * **useother**: optional. Controls whether to create an "OTHER" category for distinct column values beyond the limit. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column -> column_split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
        
          
                docs/user/ppl/cmd/chart.rst
              
                Outdated
          
        
      | * When set to true, distinct values beyond the limit are grouped into an "OTHER" category. | ||
| * Only applies when using column split and when there are more distinct column values than the limit. | ||
|  | ||
| * **usenull**: optional. Controls whether to include null values as a separate category. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change doc, make it clearly.
- usenull=true only applie to column_split
- row_split should always be non-null value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. row_split can actually contain null; it will be handled in the same manner as normal aggregations like stats count() by a, b where there exists null values in column a.
        
          
                docs/user/ppl/cmd/chart.rst
              
                Outdated
          
        
      | Notes | ||
| ===== | ||
|  | ||
| * The column split field in the result will become strings so that they are compatible with ``nullstr`` and ``otherstr`` and can be used as column names once pivoted. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column split field in the result will become strings ->
The fields generated by column splitting are converted to strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for the suggestion!
| os> source=accounts | chart limit=1 count() over gender by age | ||
| fetched rows / total rows = 3/3 | ||
| +--------+-------+---------+ | ||
| | gender | age | count() | | ||
| |--------+-------+---------| | ||
| | M | OTHER | 2 | | ||
| | M | 33 | 1 | | ||
| | F | OTHER | 1 | | ||
| +--------+-------+---------+ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect result should another row?
F 33 0
then, pivot table will be
gender 33 OTHER
M,1,2
F,1,0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are not really doing pivoting, I think it's better to omit empty groups? This avoids a large sparse response and reduces traffic. Besides, other aggregations also don't return results for empty buckets.
timechart also claims to omit those buckets:
Only combinations with actual data are included in the results - empty combinations are omitted rather than showing null or zero values.
If front-end wants to add it back, they can easily fill null or 0 to those missing groups.
        
          
                docs/user/ppl/cmd/chart.rst
              
                Outdated
          
        
      |  | ||
| PPL query:: | ||
|  | ||
| os> source=accounts | chart limit=top 1 useother=true otherstr='minor_gender' count() over state by gender | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax should be limit=top10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for double checking
|  | ||
| PPL query:: | ||
|  | ||
| os> source=accounts | chart usenull=true nullstr='employer not specified' count() over firstname by employer | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an example to demo convert column_split to string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually covered by example 3 and 4. Updated their descriptions.
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Description
The chart command returns an aggregation result in a two-dimension table format.
Work items:
limit,limit=top x,limit=bottom xuseother,otherstrusenull,nullstrRelated Issues
Resolves #399
Implementation Walk-through
Ideally, chart should pivot the result into a 2-dimension table. E.g. for the following table:
| chart avg(val) by a, bshould make it a table like this:However, it seems dynamic pivoting is not supported in SQL/Calcite (see original discussion in #3965 (comment)). Therefore, the result table for the implemented
chartis like:The pivoting can be performed in the front-end.
The above operation is equivalent to
stats avg(val) by a, b-- this is the case when parameters likeusenull,useother, andlimitis not involved in the result.When these parameters are involved,
chartcommand will find the top-N categories ofb, aggregating the rest to anOTHERcategory, and aggregating those whosebis null to a "NULL" category. This leads to the following implementation:stats agg_func by a, b)Note:
This implementation did not reuse the implementation of timechart to circumvent some existing bugs. A following PR will merge their implementation as chart essentially is a superset of timechart in terms of functionality.
Future work items
timechartandchartbins(after Fixbinson time-related fields #4612 )Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.