Skip to content

Conversation

vignesh14052002
Copy link

@jbrockmendel
Copy link
Member

Perf impact? This is here for a reason.

@vignesh14052002
Copy link
Author

This is the commit that introduces the change
6c31cab

I dont't understand why this was added, but reverting solves the below issue

from pandas._libs.parsers import sanitize_objects

values = np.array([1,"NA",True],dtype=object)
print("Values before sanitization:",values)
sanitize_objects(values,na_values={"NA"})
print("Values after sanitization:",values)

output

Values before sanitization: [1 'NA' True]
Values after sanitization: [1 nan 1]

Eventhough the sanitization parts works fine (NA->nan), it is converting True to 1 and that is due to the memo
I have some issues setting up the environment to run performance tests

@jbrockmendel
Copy link
Member

I dont't understand why this was added, but reverting solves the below issue

The commit message was "memoize objects when reading from file to reduce memory footprint". So removing it will likely balloon memory footprint. Instead of removing it, might be more effective to just check for 0, 1, True, False explicitly and let other values be memoized?

@vignesh14052002
Copy link
Author

Thanks, now i understand about memory footprint. skipping memoization just for those 4 values might not be a good approach, because what if the data contains only mixture of those 4 values? it can blew up the memory

I have included type of the value too in memo key, which will solve this

@vignesh14052002 vignesh14052002 changed the title fix : remove memo usage in sanitization function fix : include datatype in memo key in sanitization function Sep 3, 2025
@@ -2129,12 +2129,13 @@ def sanitize_objects(ndarray[object] values, set na_values) -> int:

for i in range(n):
val = values[i]
memo_key = (val, type(val))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf impact? I suspect this hashing is slower

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

measured performance by sanitizing 5 million values

import time
import numpy as np
import contextlib

@contextlib.contextmanager
def log_duration(title):
    start = time.perf_counter()
    try:
        yield
    finally:
        end = time.perf_counter()
        print(f"{title}: {end - start:.4f} seconds")

million = 10**6

def get_synthetic_data():
    int_values = np.array([i for i in range(million)], dtype=object)
    float_values = np.array([float(i) for i in range(million)], dtype=object)
    str_values = np.array([str(i) for i in range(million)], dtype=object)
    bool_values = np.array([i % 2 == 0 for i in range(million)], dtype=object)
    none_values = np.array(["NA" for _ in range(million)], dtype=object)

    mixed_values = np.empty(million * 5, dtype=object)
    mixed_values[0::5] = int_values
    mixed_values[1::5] = float_values
    mixed_values[2::5] = str_values
    mixed_values[3::5] = bool_values
    mixed_values[4::5] = none_values
    np.random.seed(42)
    np.random.shuffle(mixed_values)
    return mixed_values

values = get_synthetic_data()

with log_duration("sanitize_objects_old"):
    sanitize_objects_old(values,na_values={"NA"})

values = get_synthetic_data()

with log_duration("sanitize_objects_include_type_in_memo_key"):
    sanitize_objects(values,na_values={"NA"})

values = get_synthetic_data()

with log_duration("sanitize_objects_skip_bool"):
    sanitize_objects_skip_bool(values,na_values={"NA"})

Output

sanitize_objects_old: 1.5880 seconds
sanitize_objects_include_type_in_memo_key: 1.9344 seconds
sanitize_objects_skip_bool: 1.6926 seconds

yes you are right, using type in memo key is 20% slower, skipping those 4 values seems a better option, python automatically uses references for same int and bool values, so skipping will not increase memory footprint even if the data is mixture of only those 4 values

@vignesh14052002 vignesh14052002 changed the title fix : include datatype in memo key in sanitization function fix : skip memoizing 0,1,True,False in sanitization function Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Datatypes not preserved on pd.read_excel
2 participants