Description
Current Behaviour
Analyzing the supplied 40k-rows-by-16 (5MB) pd.DataFrame
eats 16+GB of memory on a 16-core machine:
This is simplified and shrunk from an actual dataset of twice the size that crashes, out-of-memory on my 32GB computer.
The problem seems to be that the adfuller function requires significant amounts of transient memory for certain data patterns, and that this memory is simultaneously requested in each of 16 threads.
Expected Behaviour
I'd expect memory usage to be somewhat less extreme :-)
A workaround is to pass pool_size=1
, which limits peak memory to about 1/16th.
I suggest it may be an idea to explicitly serialize the adfuller test so that it is only called by one thread at a time, regardless of pool size.
The workaround is fine for me, I don't need anything, this is just a report in case you're not aware this issue.
Data Description
The synthetic dataset below reproduces the problem.
Code that reproduces the bug
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
vals = np.linspace(0, 1, num=10)
s = np.hstack([vals] * 4000)
df = pd.DataFrame([s]*16).T
print(df.shape)
ProfileReport(df, tsmode=True, lazy=False)
pandas-profiling version
v4.16.1
Dependencies
pandas=2.2.2
numpy=1.26.4
statsmodels=0.14.2
ydata_profiling=v4.16.1
OS
ubuntu
Checklist
- There is not yet another bug report for this issue in the issue tracker
- The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
- The issue has not been resolved by the entries listed under Common Issues.