-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
fix : skip memoizing 0,1,True,False in sanitization function #62226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix : skip memoizing 0,1,True,False in sanitization function #62226
Conversation
Perf impact? This is here for a reason. |
This is the commit that introduces the change I dont't understand why this was added, but reverting solves the below issue from pandas._libs.parsers import sanitize_objects
values = np.array([1,"NA",True],dtype=object)
print("Values before sanitization:",values)
sanitize_objects(values,na_values={"NA"})
print("Values after sanitization:",values) output Values before sanitization: [1 'NA' True]
Values after sanitization: [1 nan 1] Eventhough the sanitization parts works fine (NA->nan), it is converting True to 1 and that is due to the memo |
The commit message was "memoize objects when reading from file to reduce memory footprint". So removing it will likely balloon memory footprint. Instead of removing it, might be more effective to just check for 0, 1, True, False explicitly and let other values be memoized? |
Thanks, now i understand about memory footprint. skipping memoization just for those 4 values might not be a good approach, because what if the data contains only mixture of those 4 values? it can blew up the memory I have included type of the value too in memo key, which will solve this |
pandas/_libs/parsers.pyx
Outdated
@@ -2129,12 +2129,13 @@ def sanitize_objects(ndarray[object] values, set na_values) -> int: | |||
|
|||
for i in range(n): | |||
val = values[i] | |||
memo_key = (val, type(val)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perf impact? I suspect this hashing is slower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
measured performance by sanitizing 5 million values
import time
import numpy as np
import contextlib
@contextlib.contextmanager
def log_duration(title):
start = time.perf_counter()
try:
yield
finally:
end = time.perf_counter()
print(f"{title}: {end - start:.4f} seconds")
million = 10**6
def get_synthetic_data():
int_values = np.array([i for i in range(million)], dtype=object)
float_values = np.array([float(i) for i in range(million)], dtype=object)
str_values = np.array([str(i) for i in range(million)], dtype=object)
bool_values = np.array([i % 2 == 0 for i in range(million)], dtype=object)
none_values = np.array(["NA" for _ in range(million)], dtype=object)
mixed_values = np.empty(million * 5, dtype=object)
mixed_values[0::5] = int_values
mixed_values[1::5] = float_values
mixed_values[2::5] = str_values
mixed_values[3::5] = bool_values
mixed_values[4::5] = none_values
np.random.seed(42)
np.random.shuffle(mixed_values)
return mixed_values
values = get_synthetic_data()
with log_duration("sanitize_objects_old"):
sanitize_objects_old(values,na_values={"NA"})
values = get_synthetic_data()
with log_duration("sanitize_objects_include_type_in_memo_key"):
sanitize_objects(values,na_values={"NA"})
values = get_synthetic_data()
with log_duration("sanitize_objects_skip_bool"):
sanitize_objects_skip_bool(values,na_values={"NA"})
Output
sanitize_objects_old: 1.5880 seconds
sanitize_objects_include_type_in_memo_key: 1.9344 seconds
sanitize_objects_skip_bool: 1.6926 seconds
yes you are right, using type in memo key is 20% slower, skipping those 4 values seems a better option, python automatically uses references for same int and bool values, so skipping will not increase memory footprint even if the data is mixture of only those 4 values
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.