Skip to content

Conversation

@yuancu
Copy link
Collaborator

@yuancu yuancu commented Oct 16, 2025

Description

The chart command returns an aggregation result in a two-dimension table format.

Work items:

  • support span
  • support limit, limit=top x, limit=bottom x
  • support useother, otherstr
  • correct limit behavior with non-accumulative aggregation functions (min, max, avg, etc) // fixed in Fix timechart OTHER category aggregation for non-cumulative functions #4594
  • support usenull, nullstr
  • support non-string fields as column split
  • add integration tests
  • add explain tests
  • add a doc
  • Add a brief walk-through of the implementation
  • Anonymizer & test

Related Issues

Resolves #399

Implementation Walk-through

Ideally, chart should pivot the result into a 2-dimension table. E.g. for the following table:

a b val
m x 3
m y 4

| chart avg(val) by a, b should make it a table like this:

a x y
m 3 4

However, it seems dynamic pivoting is not supported in SQL/Calcite (see original discussion in #3965 (comment)). Therefore, the result table for the implementedchart is like:

a b avg(val)
m x 3
m y 4

The pivoting can be performed in the front-end.

The above operation is equivalent to stats avg(val) by a, b -- this is the case when parameters like usenull, useother, and limit is not involved in the result.

When these parameters are involved, chart command will find the top-N categories of b, aggregating the rest to an OTHER category, and aggregating those whose b is null to a "NULL" category. This leads to the following implementation:

  1. normal aggregation based on a, b (equivalent to stats agg_func by a, b)
  2. find out the top-N categories (unique values of column b) by aggregating on the above aggregation results
    1. aggregate on b
    2. sort on aggregation results
    3. number the rows
  3. left join the ranked results with the original aggregation
  4. keep rows whose row number is no greater than the limit, categorizing the rest to OTHER or NULL
  5. Aggregate again because values categorized into OTHER or NULL need to be merged

Note:

This implementation did not reuse the implementation of timechart to circumvent some existing bugs. A following PR will merge their implementation as chart essentially is a superset of timechart in terms of functionality.

Future work items

  • support multiple aggregation functions (Left as a TODO in the future: the output will be messy when multiple aggregations are involved because the results are not pivoted.)
  • unify implementation of timechart and chart
  • support more bin options like bins (after Fix bins on time-related fields #4612 )

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuancu yuancu added the feature label Oct 16, 2025
@yuancu yuancu force-pushed the issues/399 branch 2 times, most recently from 8297023 to 6b8934e Compare October 24, 2025 06:12
@yuancu yuancu marked this pull request as ready for review October 24, 2025 08:56
@yuancu yuancu marked this pull request as draft October 28, 2025 14:38
@yuancu yuancu marked this pull request as ready for review October 29, 2025 01:58
@yuancu yuancu changed the title WIP: Support chart command in PPL Support chart command in PPL Oct 29, 2025
yuancu added 23 commits October 30, 2025 10:26
Signed-off-by: Yuanchun Shen <[email protected]>

# Conflicts:
#	core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
#	integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Signed-off-by: Yuanchun Shen <[email protected]>
Copy link
Collaborator Author

@yuancu yuancu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code explanation.


// Convert the column split to string if necessary: column split was supposed to be pivoted to
// column names. This guarantees that its type compatibility with useother and usenull
RexNode colSplit = relBuilder.field(1);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fields are [row-split, col-split, aggregation] now

Comment on lines +2085 to +2091
if (!SqlTypeUtil.isCharacter(colSplit.getType())) {
colSplit =
relBuilder.alias(
context.rexBuilder.makeCast(
UserDefinedFunctionUtils.NULLABLE_STRING, colSplit, true, true),
columSplitName);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert the column split to string so that they can be labels of columns once pivoted. This also guarantees that its type is compatible with nullstr and otherstr.

// 1: column-split, 2: agg
relBuilder.project(relBuilder.field(1), relBuilder.field(2));
// Make sure that rows who don't have a column split not interfere grand total calculation
relBuilder.filter(relBuilder.isNotNull(relBuilder.field(0)));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testChartWithNullAndLimit covers this case. Without this line, it will number rows who don't have a column split if their aggregation result is great.

Comment on lines +2105 to +2110
// Apply sorting: for MIN/EARLIEST, reverse the top/bottom logic
boolean smallestFirst =
aggFunction == BuiltinFunctionName.MIN || aggFunction == BuiltinFunctionName.EARLIEST;
if (config.top != smallestFirst) {
grandTotal = relBuilder.desc(grandTotal);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See explanations in #4594

Comment on lines +2127 to +2132
relBuilder.push(aggregated);
relBuilder.push(ranked);

// on column-split = group key
relBuilder.join(
JoinRelType.LEFT, relBuilder.equals(relBuilder.field(2, 0, 1), relBuilder.field(2, 1, 0)));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregated: [row-split, col-split, aggregation]
ranked: [col-split, grand-total, row-number]

Comment on lines +2169 to +2171
relBuilder.aggregate(
relBuilder.groupKey(relBuilder.field(0), relBuilder.field(1)),
buildAggCall(context.relBuilder, aggFunction, relBuilder.field(2)).as(aggFieldName));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final aggregation: to merge values in the OTHER categories.

* **limit**: optional. Specifies the number of distinct values to display when using column split.

* Default: 10
* Syntax: ``limit=(top|bottom) <number>`` or ``limit=<number>`` (defaults to top)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain definition of top in doc, is it stats dc by col | sort -dc, +col?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to keep the top-K categories (distinct column splits).

E.g.

  • chart limit=1 count() by a b keeps the top 1 b categories with most rows
  • chart limit=bottom3 sum(value) by a b keeps the one b categories with minimum sum of values in its category.
  • chart limit=top2 min(value) by a b keeps 2 b categories whose minimum value within its group are the smallest 2.
  • chart limit=bottom2 min(value) by a b keep 2 b categories whose minimum value within its group are the largest 2.

* Set to 0 to show all distinct values without any limit.
* Only applies when column split presents (by 2 fields or over...by... coexists).

* **useother**: optional. Controls whether to create an "OTHER" category for distinct column values beyond the limit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column -> column_split

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* When set to true, distinct values beyond the limit are grouped into an "OTHER" category.
* Only applies when using column split and when there are more distinct column values than the limit.

* **usenull**: optional. Controls whether to include null values as a separate category.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change doc, make it clearly.

  • usenull=true only applie to column_split
  • row_split should always be non-null value.

Copy link
Collaborator Author

@yuancu yuancu Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. row_split can actually contain null; it will be handled in the same manner as normal aggregations like stats count() by a, b where there exists null values in column a.

Notes
=====

* The column split field in the result will become strings so that they are compatible with ``nullstr`` and ``otherstr`` and can be used as column names once pivoted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column split field in the result will become strings ->
The fields generated by column splitting are converted to strings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for the suggestion!

Comment on lines +144 to +152
os> source=accounts | chart limit=1 count() over gender by age
fetched rows / total rows = 3/3
+--------+-------+---------+
| gender | age | count() |
|--------+-------+---------|
| M | OTHER | 2 |
| M | 33 | 1 |
| F | OTHER | 1 |
+--------+-------+---------+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect result should another row?
F 33 0

then, pivot table will be
gender 33 OTHER
M,1,2
F,1,0

Copy link
Collaborator Author

@yuancu yuancu Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not really doing pivoting, I think it's better to omit empty groups? This avoids a large sparse response and reduces traffic. Besides, other aggregations also don't return results for empty buckets.

timechart also claims to omit those buckets:

Only combinations with actual data are included in the results - empty combinations are omitted rather than showing null or zero values.

If front-end wants to add it back, they can easily fill null or 0 to those missing groups.


PPL query::

os> source=accounts | chart limit=top 1 useother=true otherstr='minor_gender' count() over state by gender
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax should be limit=top10

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for double checking


PPL query::

os> source=accounts | chart usenull=true nullstr='employer not specified' count() over firstname by employer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example to demo convert column_split to string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually covered by example 3 and 4. Updated their descriptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] PPL Chart Command

2 participants