Skip to content

Conversation

hqbhoho
Copy link
Contributor

@hqbhoho hqbhoho commented Sep 30, 2025

Description

Currently, uniqueIdSymbol come from target table.In INSERT scenarios, unique_id is always null, this causes the entire dataset to be processed by a single task, severely limiting parallelism and performance.

// Assign a unique id to every target table row
Symbol uniqueIdSymbol = symbolAllocator.newSymbol("unique_id", BIGINT);
RelationPlan planWithUniqueId = new RelationPlan(
       new AssignUniqueId(idAllocator.getNextId(), targetTablePlan.getRoot(), uniqueIdSymbol),
       mergeAnalysis.getTargetTableScope(),
       targetTablePlan.getFieldMappings(),
       outerContext);
// Mark distinct combinations of the unique_id value and the case_number
Symbol isDistinctSymbol = symbolAllocator.newSymbol("is_distinct", BOOLEAN);
MarkDistinctNode markDistinctNode = new MarkDistinctNode(idAllocator.getNextId(), project, isDistinctSymbol, ImmutableList.of(uniqueIdSymbol, caseNumberSymbol));

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text:

## General
* Improve performance of Merge for MarkDistinct. ({issue}`26759 `)

@cla-bot cla-bot bot added the cla-signed label Sep 30, 2025
@hqbhoho hqbhoho marked this pull request as draft September 30, 2025 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant