Skip to content

[BUG] PPL rename command breaks dedup operation - returns null values #4563

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

source=test-rename-bug 
| rename status as http_status 
| dedup http_status 
| fields http_status

Expected Result:
The query should return deduplicated values from the status field under the renamed alias http_status:

{
  "schema": [{"name": "http_status", "type": "string"}],
  "datarows": [["200"], ["500"], ["404"]],
  "total": 3,
  "size": 3
}

Actual Result:
The query returns only null values:

{
  "schema": [{"name": "http_status", "type": "string"}],
  "datarows": [[null]],
  "total": 1,
  "size": 1
}

Dataset Information

Dataset/Schema Type: Custom (simple test schema)

Index Mapping:

{
  "mappings": {
    "properties": {
      "status": { "type": "keyword" },
      "service": { "type": "keyword" },
      "value": { "type": "integer" }
    }
  }
}

Sample Data:

{"status":"200","service":"api","value":100}
{"status":"500","service":"web","value":200}
{"status":"200","service":"db","value":150}
{"status":"404","service":"api","value":50}
{"status":"500","service":"api","value":75}

Bug Description

Issue Summary:
When using the rename command to alias a field, subsequent dedup operations on the renamed field fail and return null values instead of the actual deduplicated data. This affects all field types (keyword, text, numeric, nested), not just nested fields.

Steps to Reproduce:

  1. Create an index with any field (e.g., status as keyword type)
  2. Insert documents with duplicate values in that field
  3. Execute: source=<index> | rename <field> as <alias> | dedup <alias> | fields <alias>
  4. Observe that the result contains only null values

Comparison:

Working (without rename):

source=test-rename-bug | dedup status | fields status

Returns: ["200"], ["500"], ["404"]

Failing (with rename):

source=test-rename-bug | rename status as http_status | dedup http_status | fields http_status

Returns: [null]

Working (rename without dedup):

source=test-rename-bug | rename status as http_status | fields http_status

Returns: All 5 documents with correct values

Impact:
This bug makes it impossible to use rename and dedup together in a query pipeline, which is a common use case for data transformation and analysis. Users must choose between renaming fields for readability or deduplicating data, but cannot do both.

Environment Information

OpenSearch Version: 3.4.0-SNAPSHOT

Additional Details:

Root Cause Analysis

Execution Plan Analysis

Using the _explain endpoint reveals the issue:

Working Query Plan (without rename):

Physical: CalciteEnumerableIndexScan(
  PushDownContext=[[PROJECT->[status], FILTER->IS NOT NULL($0)]]
)

Failing Query Plan (with rename):

Logical: LogicalProject(http_status=[$2])
Physical: CalciteEnumerableIndexScan(
  PushDownContext=[[PROJECT->[status], FILTER->IS NOT NULL($0)]]
)

The physical plan correctly pushes down the original field name (status) to OpenSearch, but the logical plan references the renamed field name (http_status).

Code-Level Root Cause

Disclaimer: This is a preliminary analysis and requires further investigation.

The bug is in OpenSearchDedupPushdownRule.java:

// Line 57-58
final List<String> fieldNameList = projectWithWindow.getInput().getRowType().getFieldNames();
List<Integer> selectColumns = PlanUtils.getSelectColumns(windows.getFirst().partitionKeys);
String fieldName = fieldNameList.get(selectColumns.getFirst());

// Line 60
CalciteLogicalIndexScan newScan = scan.pushDownCollapse(finalOutput, fieldName);

Problem: After a rename operation, projectWithWindow.getInput().getRowType().getFieldNames() returns the renamed field name (e.g., "http_status"), not the original field name (e.g., "status").

This renamed field name is then passed to pushDownCollapse() in CalciteLogicalIndexScan.java:

public CalciteLogicalIndexScan pushDownCollapse(Project finalOutput, String fieldName) {
  ExprType fieldType = osIndex.getFieldTypes().get(fieldName);
  if (fieldType == null) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("Cannot pushdown the dedup '{}' due to it is not a index field", fieldName);
    }
    return null;  // Fails silently
  }
  // ...
}

Problem: osIndex.getFieldTypes() only contains the original field names from the OpenSearch index mapping, not any renamed aliases. When it looks up "http_status", it returns null, causing the dedup pushdown optimization to fail silently.

Without the pushdown optimization, the dedup operation falls back to a less efficient execution path that doesn't properly handle the renamed field, resulting in null values in the output.

Tentative Proposed Fix

Disclaimer: This is a preliminary analysis and requires further investigation.

Option 1: Resolve Renamed Field to Original Field Name

Modify OpenSearchDedupPushdownRule.java to get the original field name from the scan's row type instead of the renamed row type:

// Instead of:
String fieldName = fieldNameList.get(selectColumns.getFirst());

// Use the scan's original field names:
final List<String> originalFieldNames = scan.getRowType().getFieldNames();
String fieldName = originalFieldNames.get(selectColumns.getFirst());

Option 2: Use ExprType's getOriginalPath()

The codebase already has a mechanism to track original field paths via ExprType.getOriginalPath(). The dedup pushdown rule could leverage this to resolve renamed fields back to their original names.

Workaround

None available. Users cannot use rename and dedup together in the same query pipeline. The only workaround is to avoid renaming fields before deduplication:

# Workaround: Dedup first, then rename
source=test-rename-bug 
| dedup status 
| rename status as http_status 
| fields http_status

However, this workaround may not be suitable for all use cases, especially when the dedup field needs to be computed or transformed before deduplication.

Additional Testing

Test Case 1: Simple Keyword Field

source=test-rename-bug | rename status as http_status | dedup http_status | fields http_status

Result: ❌ Returns [null]

Test Case 2: Nested Field (OTEL Schema)

source=otel-v1-apm-span-000001 
| rename `span.attributes.http@status_code` as my_precious 
| dedup my_precious 
| fields my_precious

Result: ❌ Returns [null]

Test Case 3: Rename + Fields (No Dedup)

source=test-rename-bug | rename status as http_status | fields http_status

Result: ✅ Works correctly, returns all 5 documents with proper values

Test Case 4: Dedup + Rename (Reversed Order)

source=test-rename-bug | dedup status | rename status as http_status | fields http_status

Result: ✅ Works correctly, returns deduplicated values

Related Issues

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions