Skip to content

Conversation

@konstantinb
Copy link
Contributor

@konstantinb konstantinb commented Aug 7, 2025

What changes were proposed in this pull request?

HIVE-29084: Proposing changes to ASTConverter's logic of tableAlias assignment for Lateral View Queries

Why are the changes needed?

Before these changes, ASTConverter used to assign the base table alias as the tableAlias of all columns of the query tree. Technically, LV columns are "separate" tables participating in an implicit join. Therefore, PPD processing considered filters with conditions between table columns and LV columns as conditions on the columns of the same table.
The following condition:

} else if (!chAlias.equalsIgnoreCase(alias)) {

made these expressions considered "pushable candidates", while the subsequent processing logic has no knowledge on how to optimize/convert/process such expressions, so they are ultimately discarded during the LateralViewJoinerPPD.removeAllCandidates() call

A very simple query to confirm the bug is

SELECT t.key, t.value, lv.col
FROM (SELECT '238' AS key, 'val_238' AS value) t
LATERAL VIEW explode(array('238', '86', '311')) lv AS col
WHERE t.key = '333' OR lv.col = '86'
ORDER BY t.key, lv.col;

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Tested locally primarily with TestMiniLlapLocalCliDriver
  • Applied the same patch to a custom Hive implementation based on Hive 4.0.1, confirmed the accuracy of the results of impacted queries after the tix

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@konstantinb konstantinb marked this pull request as ready for review August 12, 2025 22:05
@konstantinb
Copy link
Contributor Author

@zabetak I'd greatly appreciate taking a second peek at this PR

@zabetak
Copy link
Member

zabetak commented Sep 1, 2025

@konstantinb I will check tomorrow. Apologies for the delay but I was off for some time.

Comment on lines 608 to 611
// Create schema that preserves base table columns with original alias,
// but gives new UDTF columns the unique lateral view alias
int baseFieldCount = tableFunctionSource.schema.size();
List<RelDataTypeField> allOutputFields = tfs.getRowType().getFieldList();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the syntax definition, a LATERAL VIEW is a virtual table with a user-defined table alias. Conceptually, every column that is in the output of the lateral view has the same table alias so I would expect that all columns in the same schema should have the same alias.

For all conversions, inside the ASTConverter we should distinguish the input schema(s) from the output schema. Both are very important for correctly and unambiguously constructing the AST/SQL query. For the lateral view case, input and output schema are somewhat mixed together and maybe they shouldn't. Some code inside the createASTLateralView method operates on the input schema and some other on the output schema. In other words, up to a certain point in the code, I think we could use the schema as is from the input/source and once we are done we could simply generate the output (new) schema using a new (generated) table alias. The idea is outlined on the comment below.

LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
ORDER BY first_val, second_val;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the choice between SORT_QUERY_RESULTS and explicit ORDER BY in the query is somewhat subjective.
Both can avoid test flakiness and each has its own advantages & disadvantages.

Putting an ORDER BY in every query makes the tests more verbose and expands its scope. The plans will have more operators, EXPLAIN outputs will contain more info than strictly necessary, and potentially more rules will match/apply and affect the output plan. On the positive side, it is a native way to enforce sorted output and avoid potential test flakiness.

The SORT_QUERY_RESULTS applies to all queries inside the file and it is a post-processing step completely independent of the query execution. Test inputs/outputs are less verbose and flakiness does not interfere with the query execution and the actual testing scope.

Personally, for this case I feel that SORT_QUERY_RESULTS is a better choice but don't feel that strongly about it. I am OK to accept the ORDER BY approach if you prefer that. However, currently the test file contains both SORT_QUERY_RESULTS and ORDER BY clauses so we should remove one of them. I leave the final choice to you.

- clear separation between input & output schemas during LV AST conversion
- simplified and minimized test queries
…ccurately show result accuracy after the fix
@konstantinb konstantinb changed the title HIVE-29084: ensuring different tableAlias values between the base table and LV columns to avoid dropping filters during PPD HIVE-29084: use nextAlias for the output schema of LV columns after AST Conversion Sep 8, 2025
@konstantinb konstantinb requested a review from zabetak September 9, 2025 03:08
@sonarqubecloud
Copy link

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@konstantinb I am waiting a final confirmation from your side regarding the changes that I pushed but from my side everything looks good and this PR is ready to go in!

@konstantinb
Copy link
Contributor Author

@konstantinb I am waiting a final confirmation from your side regarding the changes that I pushed but from my side everything looks good and this PR is ready to go in!

@zabetak thank you very much for the modifications. I was sure I had already removed the no longer used schema constructor; my apologies.

Your refactoring is almost a 1:1 match with an intermediate working variant I had; I was concerned that the signature change of createASTLateralView() plus an early return in convertSource() might be harder to follow and could be frowned upon. I am fully comfortable with your refactoring, thank you!

@zabetak zabetak merged commit 3b3c1cf into apache:master Sep 15, 2025
4 checks passed
@zabetak
Copy link
Member

zabetak commented Sep 15, 2025

@konstantinb Many thanks for the PR and your thorough analysis and explanations here and under the JIRA ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants