Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 11 additions & 4 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -522,11 +522,18 @@ def isin(comps: ListLike, values: ListLike) -> npt.NDArray[np.bool_]:
if (
len(values) > 0
and values.dtype.kind in "iufcb"
and not is_signed_integer_dtype(comps)
and not is_dtype_equal(values, comps)
# If the dtypes differ and either side is unsigned integer,
# prefer object dtype to avoid unsafe upcast to float64 that
# can lose precision for large 64-bit integers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change the performance when values and comps are both integer like and fit within an integer type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix takes the conservative approach of converting values to [object] when mixing signed and unsigned integer dtypes to ensure correctness. This preserves exact integer equality but may be slower for very large arrays compared to a numeric-only path.
This trade-off is favoring correctness over the rare case of very large arrays with mixed signed/unsigned ints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach for better performance can be to remove the earlier asymmetric object-conversion block and add a fast, safe numeric path that correctly handles signed/unsigned mixes without converting to object.

The idea is to use masked uint64 lookups to avoid float casts and preserve performance. I’ll place this fast-path after the comps_array extraction and before the common-type coercion, by mapping signed int64 and uint64 values into the wider unsigned space and performing hashtable lookups on uint64.

This will involve changes roughly around lines algorithms.py+6-16. I’ll also run the new tests afterward to verify behavior. Not sure if that breaks something, should I try?

and (not is_dtype_equal(values, comps))
and (
(not is_signed_integer_dtype(comps))
or (not is_signed_integer_dtype(values))
)
):
# GH#46485 Use object to avoid upcast to float64 later
# TODO: Share with _find_common_type_compat
# GH#46485: Use object to avoid upcast to float64 later
# Ensure symmetric behavior when mixing signed and unsigned
# integer dtypes.
values = construct_1d_object_array_from_listlike(orig_values)

elif isinstance(values, ABCMultiIndex):
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/series/methods/test_isin.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,3 +267,16 @@ def test_isin_filtering_on_iterable(data, isin):
expected_result = Series([True, True, False])

tm.assert_series_equal(result, expected_result)


def test_isin_int64_vs_uint64_mismatch():
# Regression test for mixing signed int64 Series with uint64 values
# Ensure we do not implicitly upcast to float64 and return incorrect True
# related to GH# (user report)
ser = Series([1378774140726870442], dtype=np.int64)
vals = [np.uint64(1378774140726870528)]

res = ser.isin(vals)
# different values -> should be False
expected = Series([False])
tm.assert_series_equal(res, expected)
Loading