Skip to content

pd.DataFrame.copy() leaks in pandas 2.0.3 #868

@TendouArisu

Description

@TendouArisu

Issue Description:
Hello.
I have discovered a memory leak in the pd.DataFrame.copy() of pandas version 2.0.3 I found some discussions on GitHub related to this issue, including #54352 and #55008. I found that in this repository, metagpt/tools/libs/data_preprocess.py and metagpt/tools/libs/feature_engineering.py both used the influenced API. There may be some more files that use this influenced API.
Reproducible Example in pandas 2.0.3
Leakage is quite slow, but very much noticeable. Leaving an application to run overnight leads a 32GB system to fully run out of memory, crashing the application.

import pandas as pd
import numpy as np
from uuid import uuid4

index_length = 10_000
column_length = 100

index = list(range(index_length))
columns = [uuid4() for _ in range(column_length)]
data = np.random.random((index_length, column_length))
df = pd.DataFrame(data=data, index=index, columns=columns)

while True:
    # This leaks
    df2 = df.copy()

Suggestion
I would recommend considering an upgrade to a different version of pandas > 2.0.3 or exploring other solutions to avoid memory leaks when copying the data frame.
Any other workarounds or solutions would be greatly appreciated.
Thank you!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions