fix : skip memoizing 0,1,True,False in sanitization function #62226

vignesh14052002 · 2025-08-31T12:27:54Z

closes BUG: Datatypes not preserved on pd.read_excel #60088
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2025-08-31T17:59:47Z

Perf impact? This is here for a reason.

vignesh14052002 · 2025-09-02T09:46:12Z

This is the commit that introduces the change
6c31cab

I dont't understand why this was added, but reverting solves the below issue

from pandas._libs.parsers import sanitize_objects

values = np.array([1,"NA",True],dtype=object)
print("Values before sanitization:",values)
sanitize_objects(values,na_values={"NA"})
print("Values after sanitization:",values)

output

Values before sanitization: [1 'NA' True]
Values after sanitization: [1 nan 1]

Eventhough the sanitization parts works fine (NA->nan), it is converting True to 1 and that is due to the memo
I have some issues setting up the environment to run performance tests

jbrockmendel · 2025-09-02T17:47:57Z

I dont't understand why this was added, but reverting solves the below issue

The commit message was "memoize objects when reading from file to reduce memory footprint". So removing it will likely balloon memory footprint. Instead of removing it, might be more effective to just check for 0, 1, True, False explicitly and let other values be memoized?

vignesh14052002 · 2025-09-03T08:39:22Z

Thanks, now i understand about memory footprint. skipping memoization just for those 4 values might not be a good approach, because what if the data contains only mixture of those 4 values? it can blew up the memory

I have included type of the value too in memo key, which will solve this

jbrockmendel · 2025-09-03T21:39:26Z

pandas/_libs/parsers.pyx

@@ -2129,12 +2129,13 @@ def sanitize_objects(ndarray[object] values, set na_values) -> int:

    for i in range(n):
        val = values[i]
+        memo_key = (val, type(val))


Perf impact? I suspect this hashing is slower

measured performance by sanitizing 5 million values

import time import numpy as np import contextlib @contextlib.contextmanager def log_duration(title): start = time.perf_counter() try: yield finally: end = time.perf_counter() print(f"{title}: {end - start:.4f} seconds") million = 10**6 def get_synthetic_data(): int_values = np.array([i for i in range(million)], dtype=object) float_values = np.array([float(i) for i in range(million)], dtype=object) str_values = np.array([str(i) for i in range(million)], dtype=object) bool_values = np.array([i % 2 == 0 for i in range(million)], dtype=object) none_values = np.array(["NA" for _ in range(million)], dtype=object) mixed_values = np.empty(million * 5, dtype=object) mixed_values[0::5] = int_values mixed_values[1::5] = float_values mixed_values[2::5] = str_values mixed_values[3::5] = bool_values mixed_values[4::5] = none_values np.random.seed(42) np.random.shuffle(mixed_values) return mixed_values values = get_synthetic_data() with log_duration("sanitize_objects_old"): sanitize_objects_old(values,na_values={"NA"}) values = get_synthetic_data() with log_duration("sanitize_objects_include_type_in_memo_key"): sanitize_objects(values,na_values={"NA"}) values = get_synthetic_data() with log_duration("sanitize_objects_skip_bool"): sanitize_objects_skip_bool(values,na_values={"NA"})

Output

sanitize_objects_old: 1.5880 seconds sanitize_objects_include_type_in_memo_key: 1.9344 seconds sanitize_objects_skip_bool: 1.6926 seconds

yes you are right, using type in memo key is 20% slower, skipping those 4 values seems a better option, python automatically uses references for same int and bool values, so skipping will not increase memory footprint even if the data is mixture of only those 4 values

vignesh14052002 added 2 commits August 31, 2025 17:51

fix : remove memo usage

6047584

fix linting

6eb738a

include type in memo key to handle 0,1,True and False conflict

3078807

vignesh14052002 changed the title ~~fix : remove memo usage in sanitization function~~ fix : include datatype in memo key in sanitization function Sep 3, 2025

jbrockmendel reviewed Sep 3, 2025

View reviewed changes

skip memoization for 0,1,True,False

2aa4a01

vignesh14052002 changed the title ~~fix : include datatype in memo key in sanitization function~~ fix : skip memoizing 0,1,True,False in sanitization function Sep 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix : skip memoizing 0,1,True,False in sanitization function #62226

fix : skip memoizing 0,1,True,False in sanitization function #62226

vignesh14052002 commented Aug 31, 2025

Uh oh!

jbrockmendel commented Aug 31, 2025

Uh oh!

vignesh14052002 commented Sep 2, 2025

Uh oh!

jbrockmendel commented Sep 2, 2025

Uh oh!

vignesh14052002 commented Sep 3, 2025

Uh oh!

jbrockmendel Sep 3, 2025

Uh oh!

vignesh14052002 Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

fix : skip memoizing 0,1,True,False in sanitization function #62226

Are you sure you want to change the base?

fix : skip memoizing 0,1,True,False in sanitization function #62226

Conversation

vignesh14052002 commented Aug 31, 2025

Uh oh!

jbrockmendel commented Aug 31, 2025

Uh oh!

vignesh14052002 commented Sep 2, 2025

Uh oh!

jbrockmendel commented Sep 2, 2025

Uh oh!

vignesh14052002 commented Sep 3, 2025

Uh oh!

jbrockmendel Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

vignesh14052002 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!