Skip to content

Extreme memory usage for certain data patterns #1749

Open
@jobh

Description

@jobh

Current Behaviour

Analyzing the supplied 40k-rows-by-16 (5MB) pd.DataFrame eats 16+GB of memory on a 16-core machine:

This is simplified and shrunk from an actual dataset of twice the size that crashes, out-of-memory on my 32GB computer.

The problem seems to be that the adfuller function requires significant amounts of transient memory for certain data patterns, and that this memory is simultaneously requested in each of 16 threads.

Expected Behaviour

I'd expect memory usage to be somewhat less extreme :-)

A workaround is to pass pool_size=1, which limits peak memory to about 1/16th.

I suggest it may be an idea to explicitly serialize the adfuller test so that it is only called by one thread at a time, regardless of pool size.

The workaround is fine for me, I don't need anything, this is just a report in case you're not aware this issue.

Data Description

The synthetic dataset below reproduces the problem.

Code that reproduces the bug

import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

vals = np.linspace(0, 1, num=10)
s = np.hstack([vals] * 4000)
df = pd.DataFrame([s]*16).T
print(df.shape)
ProfileReport(df, tsmode=True, lazy=False)

pandas-profiling version

v4.16.1

Dependencies

pandas=2.2.2
numpy=1.26.4
statsmodels=0.14.2
ydata_profiling=v4.16.1

OS

ubuntu

Checklist

  • There is not yet another bug report for this issue in the issue tracker
  • The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
  • The issue has not been resolved by the entries listed under Common Issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    code quality 📈Improvements to the quality of the code base

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions