Skip to content

Commit b7e0aca

Browse files
committed
Merge branch 'main' into bug-25611
2 parents c5f2de3 + 6537afe commit b7e0aca

File tree

23 files changed

+157
-124
lines changed

23 files changed

+157
-124
lines changed

.github/workflows/docbuild-and-upload.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,6 @@ jobs:
5757
run: python web/pandas_web.py web/pandas --target-path=web/build
5858

5959
- name: Build documentation
60-
# TEMP don't let errors fail the build until all string dtype changes are fixed
61-
continue-on-error: true
6260
run: doc/make.py --warnings-are-errors
6361

6462
- name: Build the interactive terminal

doc/source/user_guide/basics.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -590,7 +590,7 @@ arguments. The special value ``all`` can also be used:
590590

591591
.. ipython:: python
592592
593-
frame.describe(include=["object"])
593+
frame.describe(include=["str"])
594594
frame.describe(include=["number"])
595595
frame.describe(include="all")
596596

doc/source/user_guide/io.rst

Lines changed: 27 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5228,33 +5228,32 @@ languages easy. Parquet can use a variety of compression techniques to shrink th
52285228
while still maintaining good read performance.
52295229

52305230
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
5231-
dtypes, including extension dtypes such as datetime with tz.
5231+
dtypes, including extension dtypes such as datetime with timezone.
52325232

52335233
Several caveats.
52345234

52355235
* Duplicate column names and non-string columns names are not supported.
5236-
* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default
5237-
indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238-
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
5236+
* The DataFrame index is written as separate column(s) when it is a non-default range index.
5237+
This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238+
force including or omitting indexes with the ``index`` argument.
52395239
* Index level names, if specified, must be strings.
52405240
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
5241-
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
5242-
* Non supported types include ``Interval`` and actual Python object types. These will raise a helpful error message
5243-
on an attempt at serialization. ``Period`` type is supported with pyarrow >= 0.16.0.
5241+
* The ``pyarrow`` engine supports the ``Period`` and ``Interval`` dtypes. ``fastparquet`` does not support those.
5242+
* Non supported types include actual Python object types. These will raise a helpful error message
5243+
on an attempt at serialization.
52445244
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data
5245-
type (requiring pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols,
5245+
type (this can also work for external extension types, requiring the extension type to implement the needed protocols,
52465246
see the :ref:`extension types documentation <extending.extension.arrow>`).
52475247

52485248
You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
52495249
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
5250-
then ``pyarrow`` is tried, and falling back to ``fastparquet``.
5250+
then ``pyarrow`` is used when installed, and falling back to ``fastparquet``.
52515251

52525252
See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__.
52535253

52545254
.. note::
52555255

5256-
These engines are very similar and should read/write nearly identical parquet format files.
5257-
``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
5256+
These engines are very similar and should read/write nearly identical parquet format files for most cases.
52585257
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
52595258

52605259
.. ipython:: python
@@ -5280,24 +5279,21 @@ Write to a parquet file.
52805279

52815280
.. ipython:: python
52825281
5283-
df.to_parquet("example_pa.parquet", engine="pyarrow")
5284-
df.to_parquet("example_fp.parquet", engine="fastparquet")
5282+
# specify engine="pyarrow" or engine="fastparquet" to use a specific engine
5283+
df.to_parquet("example.parquet")
52855284
52865285
Read from a parquet file.
52875286

52885287
.. ipython:: python
52895288
5290-
result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
5291-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow")
5292-
5289+
result = pd.read_parquet("example.parquet")
52935290
result.dtypes
52945291
52955292
By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame.
52965293

52975294
.. ipython:: python
52985295
5299-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow")
5300-
5296+
result = pd.read_parquet("example.parquet", dtype_backend="pyarrow")
53015297
result.dtypes
53025298
53035299
.. note::
@@ -5309,41 +5305,36 @@ Read only certain columns of a parquet file.
53095305

53105306
.. ipython:: python
53115307
5312-
result = pd.read_parquet(
5313-
"example_fp.parquet",
5314-
engine="fastparquet",
5315-
columns=["a", "b"],
5316-
)
5317-
result = pd.read_parquet(
5318-
"example_pa.parquet",
5319-
engine="pyarrow",
5320-
columns=["a", "b"],
5321-
)
5308+
result = pd.read_parquet("example.parquet", columns=["a", "b"])
53225309
result.dtypes
53235310
53245311
53255312
.. ipython:: python
53265313
:suppress:
53275314
5328-
os.remove("example_pa.parquet")
5329-
os.remove("example_fp.parquet")
5315+
os.remove("example.parquet")
53305316
53315317
53325318
Handling indexes
53335319
''''''''''''''''
53345320

53355321
Serializing a ``DataFrame`` to parquet may include the implicit index as one or
5336-
more columns in the output file. Thus, this code:
5322+
more columns in the output file. For example, this code:
53375323

53385324
.. ipython:: python
53395325
5340-
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
5326+
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
53415327
df.to_parquet("test.parquet", engine="pyarrow")
53425328
5343-
creates a parquet file with *three* columns if you use ``pyarrow`` for serialization:
5344-
``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the
5345-
index `may or may not <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`_
5346-
be written to the file.
5329+
creates a parquet file with *three* columns (``a``, ``b``, and
5330+
``__index_level_0__`` when using the ``pyarrow`` engine, or ``index``, ``a``,
5331+
and ``b`` when using the ``fastparquet`` engine) because the index in this case
5332+
is not a default range index. In general, the index *may or may not* be written
5333+
to the file (see the
5334+
`preserve_index keyword for pyarrow <https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes>`__
5335+
or the
5336+
`write_index keyword for fastparquet <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`__
5337+
to check the default behaviour).
53475338

53485339
This unexpected extra column causes some databases like Amazon Redshift to reject
53495340
the file, because that column doesn't exist in the target table.

doc/source/whatsnew/v0.13.0.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ API changes
184184
.. ipython:: python
185185
:okwarning:
186186
187-
dfc.loc[0]['A'] = 1111
187+
dfc.loc[0]['B'] = 1111
188188
189189
::
190190

@@ -198,7 +198,7 @@ API changes
198198

199199
.. ipython:: python
200200
201-
dfc.loc[0, 'A'] = 11
201+
dfc.loc[0, 'B'] = 1111
202202
dfc
203203
204204
- ``Panel.reindex`` has the following call signature ``Panel.reindex(items=None, major_axis=None, minor_axis=None, **kwargs)``

doc/source/whatsnew/v0.15.0.rst

Lines changed: 38 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1025,20 +1025,49 @@ Other:
10251025
- :func:`describe` on mixed-types DataFrames is more flexible. Type-based column filtering is now possible via the ``include``/``exclude`` arguments.
10261026
See the :ref:`docs <basics.describe>` (:issue:`8164`).
10271027

1028-
.. ipython:: python
1028+
.. code-block:: python
10291029
1030-
df = pd.DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1031-
'catB': ['a', 'b', 'c', 'd'] * 6,
1032-
'numC': np.arange(24),
1033-
'numD': np.arange(24.) + .5})
1034-
df.describe(include=["object"])
1035-
df.describe(include=["number", "object"], exclude=["float"])
1030+
>>> df = pd.DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
1031+
... 'catB': ['a', 'b', 'c', 'd'] * 6,
1032+
... 'numC': np.arange(24),
1033+
... 'numD': np.arange(24.) + .5})
1034+
>>> df.describe(include=["object"])
1035+
catA catB
1036+
count 24 24
1037+
unique 2 4
1038+
top foo a
1039+
freq 16 6
1040+
>>> df.describe(include=["number", "object"], exclude=["float"])
1041+
catA catB numC
1042+
count 24 24 24.000000
1043+
unique 2 4 NaN
1044+
top foo a NaN
1045+
freq 16 6 NaN
1046+
mean NaN NaN 11.500000
1047+
std NaN NaN 7.071068
1048+
min NaN NaN 0.000000
1049+
25% NaN NaN 5.750000
1050+
50% NaN NaN 11.500000
1051+
75% NaN NaN 17.250000
1052+
max NaN NaN 23.000000
10361053
10371054
Requesting all columns is possible with the shorthand 'all'
10381055

1039-
.. ipython:: python
1056+
.. code-block:: python
10401057
1041-
df.describe(include='all')
1058+
>>> df.describe(include='all')
1059+
catA catB numC numD
1060+
count 24 24 24.000000 24.000000
1061+
unique 2 4 NaN NaN
1062+
top foo a NaN NaN
1063+
freq 16 6 NaN NaN
1064+
mean NaN NaN 11.500000 12.000000
1065+
std NaN NaN 7.071068 7.071068
1066+
min NaN NaN 0.000000 0.500000
1067+
25% NaN NaN 5.750000 6.250000
1068+
50% NaN NaN 11.500000 12.000000
1069+
75% NaN NaN 17.250000 17.750000
1070+
max NaN NaN 23.000000 23.500000
10421071
10431072
Without those arguments, ``describe`` will behave as before, including only numerical columns or, if none are, only categorical columns. See also the :ref:`docs <basics.describe>`
10441073

doc/source/whatsnew/v3.0.0.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,7 @@ Other API changes
414414
- Index set operations (like union or intersection) will now ignore the dtype of
415415
an empty ``RangeIndex`` or empty ``Index`` with object dtype when determining
416416
the dtype of the resulting Index (:issue:`60797`)
417+
- Comparison operations between :class:`Index` and :class:`Series` now consistently return :class:`Series` regardless of which object is on the left or right (:issue:`36759`)
417418
- Numpy functions like ``np.isinf`` that return a bool dtype when called on a :class:`Index` object now return a bool-dtype :class:`Index` instead of ``np.ndarray`` (:issue:`52676`)
418419

419420
.. ---------------------------------------------------------------------------
@@ -718,6 +719,7 @@ Datetimelike
718719
Timedelta
719720
^^^^^^^^^
720721
- Accuracy improvement in :meth:`Timedelta.to_pytimedelta` to round microseconds consistently for large nanosecond based Timedelta (:issue:`57841`)
722+
- Bug in :class:`Timedelta` constructor failing to raise when passed an invalid keyword (:issue:`53801`)
721723
- Bug in :meth:`DataFrame.cumsum` which was raising ``IndexError`` if dtype is ``timedelta64[ns]`` (:issue:`57956`)
722724

723725
Timezones

pandas/_libs/tslibs/timedeltas.pyx

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2006,6 +2006,20 @@ class Timedelta(_Timedelta):
20062006
"milliseconds", "microseconds", "nanoseconds"}
20072007

20082008
def __new__(cls, object value=_no_input, unit=None, **kwargs):
2009+
unsupported_kwargs = set(kwargs)
2010+
unsupported_kwargs.difference_update(cls._req_any_kwargs_new)
2011+
if unsupported_kwargs or (
2012+
value is _no_input and
2013+
not cls._req_any_kwargs_new.intersection(kwargs)
2014+
):
2015+
raise ValueError(
2016+
# GH#53801
2017+
"cannot construct a Timedelta from the passed arguments, "
2018+
"allowed keywords are "
2019+
"[weeks, days, hours, minutes, seconds, "
2020+
"milliseconds, microseconds, nanoseconds]"
2021+
)
2022+
20092023
if value is _no_input:
20102024
if not len(kwargs):
20112025
raise ValueError("cannot construct a Timedelta without a "
@@ -2014,16 +2028,6 @@ class Timedelta(_Timedelta):
20142028

20152029
kwargs = {key: _to_py_int_float(kwargs[key]) for key in kwargs}
20162030

2017-
unsupported_kwargs = set(kwargs)
2018-
unsupported_kwargs.difference_update(cls._req_any_kwargs_new)
2019-
if unsupported_kwargs or not cls._req_any_kwargs_new.intersection(kwargs):
2020-
raise ValueError(
2021-
"cannot construct a Timedelta from the passed arguments, "
2022-
"allowed keywords are "
2023-
"[weeks, days, hours, minutes, seconds, "
2024-
"milliseconds, microseconds, nanoseconds]"
2025-
)
2026-
20272031
# GH43764, convert any input to nanoseconds first and then
20282032
# create the timedelta. This ensures that any potential
20292033
# nanosecond contributions from kwargs parsed as floats

pandas/core/arrays/categorical.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -794,28 +794,28 @@ def categories(self) -> Index:
794794
795795
>>> ser = pd.Series(["a", "b", "c", "a"], dtype="category")
796796
>>> ser.cat.categories
797-
Index(['a', 'b', 'c'], dtype='object')
797+
Index(['a', 'b', 'c'], dtype='str')
798798
799799
>>> raw_cat = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"])
800800
>>> ser = pd.Series(raw_cat)
801801
>>> ser.cat.categories
802-
Index(['b', 'c', 'd'], dtype='object')
802+
Index(['b', 'c', 'd'], dtype='str')
803803
804804
For :class:`pandas.Categorical`:
805805
806806
>>> cat = pd.Categorical(["a", "b"], ordered=True)
807807
>>> cat.categories
808-
Index(['a', 'b'], dtype='object')
808+
Index(['a', 'b'], dtype='str')
809809
810810
For :class:`pandas.CategoricalIndex`:
811811
812812
>>> ci = pd.CategoricalIndex(["a", "c", "b", "a", "c", "b"])
813813
>>> ci.categories
814-
Index(['a', 'b', 'c'], dtype='object')
814+
Index(['a', 'b', 'c'], dtype='str')
815815
816816
>>> ci = pd.CategoricalIndex(["a", "c"], categories=["c", "b", "a"])
817817
>>> ci.categories
818-
Index(['c', 'b', 'a'], dtype='object')
818+
Index(['c', 'b', 'a'], dtype='str')
819819
"""
820820
return self.dtype.categories
821821

pandas/core/arrays/datetimelike.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1486,7 +1486,8 @@ def __rsub__(self, other):
14861486
# GH#19959 datetime - datetime is well-defined as timedelta,
14871487
# but any other type - datetime is not well-defined.
14881488
raise TypeError(
1489-
f"cannot subtract {type(self).__name__} from {type(other).__name__}"
1489+
f"cannot subtract {type(self).__name__} from "
1490+
f"{type(other).__name__}[{other.dtype}]"
14901491
)
14911492
elif isinstance(self.dtype, PeriodDtype) and lib.is_np_dtype(other_dtype, "m"):
14921493
# TODO: Can we simplify/generalize these cases at all?
@@ -1495,8 +1496,14 @@ def __rsub__(self, other):
14951496
self = cast("TimedeltaArray", self)
14961497
return (-self) + other
14971498

1499+
flipped = self - other
1500+
if flipped.dtype.kind == "M":
1501+
# GH#59571 give a more helpful exception message
1502+
raise TypeError(
1503+
f"cannot subtract {type(self).__name__} from {type(other).__name__}"
1504+
)
14981505
# We get here with e.g. datetime objects
1499-
return -(self - other)
1506+
return -flipped
15001507

15011508
def __iadd__(self, other) -> Self:
15021509
result = self + other

pandas/core/dtypes/dtypes.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -647,7 +647,7 @@ def categories(self) -> Index:
647647
--------
648648
>>> cat_type = pd.CategoricalDtype(categories=["a", "b"], ordered=True)
649649
>>> cat_type.categories
650-
Index(['a', 'b'], dtype='object')
650+
Index(['a', 'b'], dtype='str')
651651
"""
652652
return self._categories
653653

0 commit comments

Comments
 (0)