Skip to content

Commit 4d11be8

Browse files
⚡️ Speed up function pivot_table by 2,181%
The optimization achieves a **2180% speedup** by eliminating the most expensive operation in the original code: repeatedly calling `df.iloc[i]` to access DataFrame rows. **Key Optimization: Vectorized Column Extraction** The critical change replaces the inefficient row-by-row DataFrame access: ```python # Original: Expensive row access (71.1% of total time) for i in range(len(df)): row = df.iloc[i] # This line alone took 244ms out of 344ms total index_val = row[index] column_val = row[columns] value = row[values] ``` With direct NumPy array extraction and zip iteration: ```python # Optimized: Extract entire columns as arrays once index_arr = df[index].values # 2.4ms columns_arr = df[columns].values # 1.3ms values_arr = df[values].values # 1.3ms # Then iterate over arrays directly for index_val, column_val, value in zip(index_arr, columns_arr, values_arr): ``` **Why This Works** 1. **DataFrame.iloc[i] is extremely slow** - it creates a new Series object for each row access and involves significant pandas overhead for indexing operations 2. **Array access is fast** - NumPy arrays provide direct memory access with minimal overhead 3. **Bulk extraction is efficient** - Getting entire columns at once leverages pandas' optimized column operations **Performance Impact by Test Case** The optimization excels across all test scenarios: - **Large-scale tests see massive gains**: 3543-6406% speedup for datasets with 1000+ rows - **Medium datasets (100-900 rows)**: 1560-5350% speedup - **Small datasets**: 57-129% speedup - **Edge cases**: Generally 19-92% faster, though very small datasets (single row, empty) show minimal or slightly negative impact due to the overhead of array extraction The optimization is particularly effective for scenarios with many rows since it eliminates the O(n) DataFrame row access overhead, making the algorithm scale much better with dataset size.
1 parent 9b951ff commit 4d11be8

File tree

1 file changed

+13
-5
lines changed

1 file changed

+13
-5
lines changed

src/numpy_pandas/dataframe_operations.py

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,31 +62,39 @@ def pivot_table(
6262
df: pd.DataFrame, index: str, columns: str, values: str, aggfunc: str = "mean"
6363
) -> dict[Any, dict[Any, float]]:
6464
result = {}
65+
# Define aggregation function
6566
if aggfunc == "mean":
6667

6768
def agg_func(values):
6869
return sum(values) / len(values)
70+
6971
elif aggfunc == "sum":
7072

7173
def agg_func(values):
7274
return sum(values)
75+
7376
elif aggfunc == "count":
7477

7578
def agg_func(values):
7679
return len(values)
80+
7781
else:
7882
raise ValueError(f"Unsupported aggregation function: {aggfunc}")
83+
84+
# Vectorized extraction of columns for faster row iteration
85+
index_arr = df[index].values
86+
columns_arr = df[columns].values
87+
values_arr = df[values].values
88+
89+
# Populate grouped_data directly using arrays, avoiding DataFrame row objects
7990
grouped_data = {}
80-
for i in range(len(df)):
81-
row = df.iloc[i]
82-
index_val = row[index]
83-
column_val = row[columns]
84-
value = row[values]
91+
for index_val, column_val, value in zip(index_arr, columns_arr, values_arr):
8592
if index_val not in grouped_data:
8693
grouped_data[index_val] = {}
8794
if column_val not in grouped_data[index_val]:
8895
grouped_data[index_val][column_val] = []
8996
grouped_data[index_val][column_val].append(value)
97+
9098
for index_val in grouped_data:
9199
result[index_val] = {}
92100
for column_val in grouped_data[index_val]:

0 commit comments

Comments
 (0)