You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The optimized code achieves a 1071% speedup by replacing slow pandas `.iloc[]` operations with fast NumPy array indexing. Here are the key optimizations:
**1. NumPy Array Access Instead of .iloc[]**
- **Original**: Used `right.iloc[i][right_on]` and `left.iloc[i]` for data access, which are extremely slow pandas operations
- **Optimized**: Converted DataFrames to NumPy arrays (`left.values`, `right.values`) and used direct array indexing like `right_values[i, right_on_idx]`
- **Impact**: The line profiler shows `right.iloc[right_idx]` took 60.4% of total time in the original (8.32s), while the equivalent NumPy operations are barely visible in the optimized version
**2. Pre-computed Column Index Mappings**
- **Original**: Accessed columns by name repeatedly: `left_row[col]` and `right_row[col]`
- **Optimized**: Pre-computed column-to-index mappings (`left_col_indices`, `right_col_indices`) and used direct array indexing: `left_values[i, left_col_indices[col]]`
- **Impact**: Eliminates repeated column name lookups and leverages NumPy's optimized indexing
**3. Direct Column Index Lookup**
- **Original**: Accessed join columns through pandas Series indexing
- **Optimized**: Used `columns.get_loc()` to get integer indices upfront, enabling direct NumPy array access
**Why This Works:**
- **NumPy vs Pandas**: NumPy arrays provide O(1) direct memory access, while pandas `.iloc[]` has significant overhead for type checking, alignment, and Series creation
- **Memory Layout**: NumPy arrays store data contiguously in memory, enabling faster access patterns
- **Reduced Object Creation**: The original created pandas Series objects for each row access; the optimized version works directly with primitive values
**Test Case Performance:**
The optimizations are most effective for:
- **Large datasets**: `test_large_scale_many_duplicates` shows 753% speedup - the more data accessed, the greater the NumPy advantage
- **Many matches**: Cases with frequent `.iloc[]` calls benefit most from the NumPy conversion
- **Cartesian products**: When duplicate keys create many row combinations, the NumPy indexing advantage compounds
The optimization maintains identical functionality while dramatically reducing the computational overhead of data access operations.
0 commit comments