Skip to content

API: reconsider returning read-only arrays from DataFrame/Series .array/.values/__array__ #63099

@jorisvandenbossche

Description

@jorisvandenbossche

Context: during the implementation of the Copy-on-Write feature (#48998), there was the idea to make returned arrays read-only for APIs that return underlying arrays (.values, to_numpy(), __array__).

This was initially only done for numpy arrays (the first two PRs), and recently also for columns backed by ExtensionArrays (both for when returning an EA (.values / .array) or returning the EA as a numpy array (to_numpy(), __array__)):

The idea behind returning a read-only array is as follows: with Copy-on-Write, the guarantee we provide is that mutating one pandas object (Series, DataFrame) doesn't update another pandas object (whose data is shared as an implementation detail). But users can still easily get a viewing numpy array, and mutate that one. And at that point, we don't have any control over how this mutation propagates (it might update more objects than just the one from which the user obtained it, for example if other Series/DataFrames were sharing data with this object with CoW).

Example to illustrate this:

# creating a dataframe and a derived dataframe through some operation
# (that in this case didn't need to copy)
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> df2 = df.sort_values(by="a").reset_index()

# getting a column and mutating this -> CoW gets triggered and only `ser` is changed, not `df`
>>> ser = df["a"]
>>> ser[0] = 100
>>> ser
0    100
1      2
2      3
Name: a, dtype: int64
>>> df
   a  b
0  1  4
1  2  5
2  3  6

# however, when the code is mutating the numpy array it got from the series (or dataframe)
# (though .values, or np.asarray(ser), etc), then even the derived `df2` is silently mutated
>>> ser = df["a"]
>>> arr = ser.values
>>> arr.flags.writeable = True  # <-- this is now needed because we made .values readonly
>>> arr[0] = 100
>>> df2
   index    a  b
0      0  100  4
1      1    2  5
2      2    3  6

Right now, with returning read-only arrays, I have to include arr.flags.writeable = True to make this work (otherwise the above example would raise an error in arr[0] = 100 about the array being read-only).

But if we didn't make the returned arrays read-only, this would work, and such mutations of the underlying numpy array would propagate unpredictably to other pandas series/dataframe objects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions