Skip to content

Commit fc5f788

Browse files
⚡️ Speed up function dataframe_merge by 1,072%
The optimized code achieves a 1071% speedup by replacing slow pandas `.iloc[]` operations with fast NumPy array indexing. Here are the key optimizations: **1. NumPy Array Access Instead of .iloc[]** - **Original**: Used `right.iloc[i][right_on]` and `left.iloc[i]` for data access, which are extremely slow pandas operations - **Optimized**: Converted DataFrames to NumPy arrays (`left.values`, `right.values`) and used direct array indexing like `right_values[i, right_on_idx]` - **Impact**: The line profiler shows `right.iloc[right_idx]` took 60.4% of total time in the original (8.32s), while the equivalent NumPy operations are barely visible in the optimized version **2. Pre-computed Column Index Mappings** - **Original**: Accessed columns by name repeatedly: `left_row[col]` and `right_row[col]` - **Optimized**: Pre-computed column-to-index mappings (`left_col_indices`, `right_col_indices`) and used direct array indexing: `left_values[i, left_col_indices[col]]` - **Impact**: Eliminates repeated column name lookups and leverages NumPy's optimized indexing **3. Direct Column Index Lookup** - **Original**: Accessed join columns through pandas Series indexing - **Optimized**: Used `columns.get_loc()` to get integer indices upfront, enabling direct NumPy array access **Why This Works:** - **NumPy vs Pandas**: NumPy arrays provide O(1) direct memory access, while pandas `.iloc[]` has significant overhead for type checking, alignment, and Series creation - **Memory Layout**: NumPy arrays store data contiguously in memory, enabling faster access patterns - **Reduced Object Creation**: The original created pandas Series objects for each row access; the optimized version works directly with primitive values **Test Case Performance:** The optimizations are most effective for: - **Large datasets**: `test_large_scale_many_duplicates` shows 753% speedup - the more data accessed, the greater the NumPy advantage - **Many matches**: Cases with frequent `.iloc[]` calls benefit most from the NumPy conversion - **Cartesian products**: When duplicate keys create many row combinations, the NumPy indexing advantage compounds The optimization maintains identical functionality while dramatically reducing the computational overhead of data access operations.
1 parent 9b951ff commit fc5f788

File tree

1 file changed

+24
-9
lines changed

1 file changed

+24
-9
lines changed

src/numpy_pandas/dataframe_operations.py

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,26 +34,38 @@ def groupby_mean(df: pd.DataFrame, group_col: str, value_col: str) -> dict[Any,
3434
def dataframe_merge(
3535
left: pd.DataFrame, right: pd.DataFrame, left_on: str, right_on: str
3636
) -> pd.DataFrame:
37-
result_data = []
37+
# Use numpy for fast access to data and zip view for columns
3838
left_cols = list(left.columns)
3939
right_cols = [col for col in right.columns if col != right_on]
40+
41+
left_on_idx = left.columns.get_loc(left_on)
42+
right_on_idx = right.columns.get_loc(right_on)
43+
44+
left_values = left.values
45+
right_values = right.values
46+
47+
# Build right_dict using numpy array for fast data lookups
4048
right_dict = {}
41-
for i in range(len(right)):
42-
key = right.iloc[i][right_on]
49+
for i in range(len(right_values)):
50+
key = right_values[i, right_on_idx]
4351
if key not in right_dict:
4452
right_dict[key] = []
4553
right_dict[key].append(i)
46-
for i in range(len(left)):
47-
left_row = left.iloc[i]
48-
key = left_row[left_on]
54+
55+
result_data = []
56+
# Precompute col->index for faster access
57+
left_col_indices = {col: idx for idx, col in enumerate(left_cols)}
58+
right_col_indices = {col: idx for idx, col in enumerate(right.columns)}
59+
for i in range(len(left_values)):
60+
key = left_values[i, left_on_idx]
4961
if key in right_dict:
5062
for right_idx in right_dict[key]:
51-
right_row = right.iloc[right_idx]
5263
new_row = {}
64+
# Use numpy fast value access
5365
for col in left_cols:
54-
new_row[col] = left_row[col]
66+
new_row[col] = left_values[i, left_col_indices[col]]
5567
for col in right_cols:
56-
new_row[col] = right_row[col]
68+
new_row[col] = right_values[right_idx, right_col_indices[col]]
5769
result_data.append(new_row)
5870
return pd.DataFrame(result_data)
5971

@@ -66,14 +78,17 @@ def pivot_table(
6678

6779
def agg_func(values):
6880
return sum(values) / len(values)
81+
6982
elif aggfunc == "sum":
7083

7184
def agg_func(values):
7285
return sum(values)
86+
7387
elif aggfunc == "count":
7488

7589
def agg_func(values):
7690
return len(values)
91+
7792
else:
7893
raise ValueError(f"Unsupported aggregation function: {aggfunc}")
7994
grouped_data = {}

0 commit comments

Comments
 (0)