-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
Issue Description:
Hello.
I have discovered a memory leak in the pd.DataFrame.copy()
of pandas version 2.0.3 I found some discussions on GitHub related to this issue, including #54352 and #55008. I found that in this repository, metagpt/tools/libs/data_preprocess.py
and metagpt/tools/libs/feature_engineering.py
both used the influenced API. There may be some more files that use this influenced API.
Reproducible Example in pandas 2.0.3
Leakage is quite slow, but very much noticeable. Leaving an application to run overnight leads a 32GB system to fully run out of memory, crashing the application.
import pandas as pd
import numpy as np
from uuid import uuid4
index_length = 10_000
column_length = 100
index = list(range(index_length))
columns = [uuid4() for _ in range(column_length)]
data = np.random.random((index_length, column_length))
df = pd.DataFrame(data=data, index=index, columns=columns)
while True:
# This leaks
df2 = df.copy()
Suggestion
I would recommend considering an upgrade to a different version of pandas > 2.0.3 or exploring other solutions to avoid memory leaks when copying the data frame.
Any other workarounds or solutions would be greatly appreciated.
Thank you!