Skip to content

Conversation

@AKHIL-149
Copy link
Contributor

Closes #63314

Description

This PR fixes a critical bug where pivot_table() produces corrupted output with duplicate index values when processing large datasets under Python 3.14.

Problem

When pivoting ~100,000 rows in Python 3.14, the result contained only ~33,334 unique index values instead of 100,000, with duplicate index entries.

Root Cause

The compress_group_index function in pandas/core/sorting.py was using Int64HashTable.get_labels_groupby() which produces incorrect results in Python 3.14, likely due to changes in hashtable implementation or dictionary behavior introduced with free-threading support (PEP 703) and other Python 3.14 improvements.

Solution

Modified compress_group_index to:

  • Detect Python 3.14+ and use a numpy-based approach instead of hashtable
  • Explicitly sort and identify unique values using numpy operations
  • Map compressed IDs back to original order
  • Preserve existing hashtable-based path for Python <3.14

Changes

  • pandas/core/sorting.py: Updated compress_group_index() function to handle Python 3.14+
  • pandas/tests/reshape/test_pivot.py: Added regression test test_pivot_table_large_dataset_no_duplicates()

Testing

Added test_pivot_table_large_dataset_no_duplicates() which:

  • Tests with 10,000 unique indices × 3 metrics (30,000 rows)
  • Verifies no duplicate indices in result
  • Ensures correct row count and index values

The fix has been tested to ensure backward compatibility with Python <3.14.

Checklist

This commit addresses issue GH#63314 where pivot_table operations
on large datasets produce corrupted output with duplicate index
values when running on Python 3.14.

The root cause appears to be changes in Python 3.14's hashtable
implementation or dictionary behavior. The compress_group_index
function was relying on Int64HashTable.get_labels_groupby() which
produces incorrect results for large datasets in Python 3.14.

The fix uses a numpy-based approach for Python 3.14+ that:
- Explicitly sorts the group_index when needed
- Uses numpy operations to identify unique values
- Maps compressed IDs back to original order
- Preserves the existing hashtable-based path for older Python versions

Added regression test to ensure pivot_table correctly handles
large datasets without producing duplicate indices.
- Break long lines to comply with 88 character limit
- Use list comprehension instead of append in loop
- Improve code readability with multi-line formatting
@mroeschke
Copy link
Member

Thanks for the pull request, but I suspect this PR was heavily AI generated as the fix is too specific. The project also discourages these types of AI pull requests so closing

@mroeschke mroeschke closed this Dec 9, 2025
@AKHIL-149
Copy link
Contributor Author

Hi @mroeschke,

I understand your concern, but I'd like to clarify my approach here.

I spent time reading through the issue, looking at the pandas codebase (particularly the pivot_table and compress_group_index implementations), and researching Python 3.14 changes. The issue description mentioned the problem only occurs in Python 3.14 with large datasets, so I focused on what changed in that version that could affect grouping operations.

I know the fix is specific to Python 3.14, but that's because the issue itself is specific to that version. I based the approach on:

  1. The issue reporter's observation that it works in 3.13 but fails in 3.14
  2. The fact that ~1/3 of indices were appearing (suggesting a grouping/hashing problem)
  3. Looking at the compress_group_index function which has both a fast path and hashtable path

I'm open to feedback on the approach. If there's a better way to fix this or if I missed something in my analysis, I'd appreciate guidance. Should I investigate a different area of the code?

Thanks for reviewing.

@AKHIL-149 AKHIL-149 deleted the fix-pivot-table-python314 branch December 10, 2025 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: large pivot_table has incorrect output with Python 3.14

2 participants