Skip to content

Commit 8436f62

Browse files
author
Shashwat1001
committed
Merge branch 'doc-broadcasting-clarity' of https://github.com/Shashwat1001/pandas into doc-broadcasting-clarity
2 parents 924e19b + 363688e commit 8436f62

File tree

85 files changed

+729
-405
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

85 files changed

+729
-405
lines changed

.github/workflows/unit-tests.yml

Lines changed: 9 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
env_file: [actions-310.yaml, actions-311.yaml, actions-312.yaml, actions-313.yaml]
3131
# Prevent the include jobs from overriding other jobs
3232
pattern: [""]
33-
pandas_future_infer_string: ["0"]
33+
pandas_future_infer_string: ["1"]
3434
include:
3535
- name: "Downstream Compat"
3636
env_file: actions-311-downstream_compat.yaml
@@ -45,6 +45,10 @@ jobs:
4545
env_file: actions-313-freethreading.yaml
4646
pattern: "not slow and not network and not single_cpu"
4747
platform: ubuntu-24.04
48+
- name: "Without PyArrow"
49+
env_file: actions-312.yaml
50+
pattern: "not slow and not network and not single_cpu"
51+
platform: ubuntu-24.04
4852
- name: "Locale: it_IT"
4953
env_file: actions-311.yaml
5054
pattern: "not slow and not network and not single_cpu"
@@ -67,18 +71,9 @@ jobs:
6771
# It will be temporarily activated during tests with locale.setlocale
6872
extra_loc: "zh_CN"
6973
platform: ubuntu-24.04
70-
- name: "Future infer strings"
74+
- name: "Past no infer strings"
7175
env_file: actions-312.yaml
72-
pandas_future_infer_string: "1"
73-
platform: ubuntu-24.04
74-
- name: "Future infer strings (without pyarrow)"
75-
env_file: actions-311.yaml
76-
pandas_future_infer_string: "1"
77-
platform: ubuntu-24.04
78-
- name: "Pypy"
79-
env_file: actions-pypy-39.yaml
80-
pattern: "not slow and not network and not single_cpu"
81-
test_args: "--max-worker-restart 0"
76+
pandas_future_infer_string: "0"
8277
platform: ubuntu-24.04
8378
- name: "Numpy Dev"
8479
env_file: actions-311-numpydev.yaml
@@ -88,7 +83,6 @@ jobs:
8883
- name: "Pyarrow Nightly"
8984
env_file: actions-311-pyarrownightly.yaml
9085
pattern: "not slow and not network and not single_cpu"
91-
pandas_future_infer_string: "1"
9286
platform: ubuntu-24.04
9387
fail-fast: false
9488
name: ${{ matrix.name || format('{0} {1}', matrix.platform, matrix.env_file) }}
@@ -97,13 +91,13 @@ jobs:
9791
LANG: ${{ matrix.lang || 'C.UTF-8' }}
9892
LC_ALL: ${{ matrix.lc_all || '' }}
9993
PANDAS_CI: '1'
100-
PANDAS_FUTURE_INFER_STRING: ${{ matrix.pandas_future_infer_string || '0' }}
94+
PANDAS_FUTURE_INFER_STRING: ${{ matrix.pandas_future_infer_string || '1' }}
10195
TEST_ARGS: ${{ matrix.test_args || '' }}
10296
PYTEST_WORKERS: 'auto'
10397
PYTEST_TARGET: ${{ matrix.pytest_target || 'pandas' }}
10498
# Clipboard tests
10599
QT_QPA_PLATFORM: offscreen
106-
REMOVE_PYARROW: ${{ matrix.name == 'Future infer strings (without pyarrow)' && '1' || '0' }}
100+
REMOVE_PYARROW: ${{ matrix.name == 'Without PyArrow' && '1' || '0' }}
107101
concurrency:
108102
# https://github.community/t/concurrecy-not-work-for-push/183068/7
109103
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_future_infer_string }}-${{ matrix.platform }}
@@ -169,12 +163,9 @@ jobs:
169163
with:
170164
# xref https://github.com/cython/cython/issues/6870
171165
werror: ${{ matrix.name != 'Freethreading' }}
172-
# TODO: Re-enable once Pypy has Pypy 3.10 on conda-forge
173-
if: ${{ matrix.name != 'Pypy' }}
174166

175167
- name: Test (not single_cpu)
176168
uses: ./.github/actions/run-tests
177-
if: ${{ matrix.name != 'Pypy' }}
178169
env:
179170
# Set pattern to not single_cpu if not already set
180171
PATTERN: ${{ env.PATTERN == '' && 'not single_cpu' || matrix.pattern }}

.github/workflows/wheels.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,6 @@ jobs:
101101
- [macos-14, macosx_arm64]
102102
- [windows-2022, win_amd64]
103103
- [windows-11-arm, win_arm64]
104-
# TODO: support PyPy?
105104
python: [["cp310", "3.10"], ["cp311", "3.11"], ["cp312", "3.12"], ["cp313", "3.13"], ["cp313t", "3.13"]]
106105
include:
107106
# Build Pyodide wheels and upload them to Anaconda.org

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -175,7 +175,7 @@ All contributions, bug reports, bug fixes, documentation improvements, enhanceme
175175

176176
A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**.
177177

178-
If you are simply looking to start working with the pandas codebase, navigate to the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [good first issue](https://github.com/pandas-dev/pandas/issues?labels=good+first+issue&sort=updated&state=open) where you could start out.
178+
If you are simply looking to start working with the pandas codebase, navigate to the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?q=is%3Aissue%20state%3Aopen%20label%3ADocs%20sort%3Aupdated-desc) and [good first issue](https://github.com/pandas-dev/pandas/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22%20sort%3Aupdated-desc) where you could start out.
179179

180180
You can also triage issues which may include reproducing bug reports, or asking for vital information such as version numbers or reproduction instructions. If you would like to start triaging issues, one easy way to get started is to [subscribe to pandas on CodeTriage](https://www.codetriage.com/pandas-dev/pandas).
181181

ci/code_checks.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,9 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
5858

5959
MSG='Python and Cython Doctests' ; echo "$MSG"
6060
python -c 'import pandas as pd; pd.test(run_doctests=True)'
61-
RET=$(($RET + $?)) ; echo "$MSG" "DONE"
61+
# TEMP don't let doctests fail the build until all string dtype changes are fixed
62+
# RET=$(($RET + $?)) ; echo "$MSG" "DONE"
63+
echo "$MSG" "DONE"
6264

6365
fi
6466

@@ -72,6 +74,7 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
7274
-i "pandas.Series.dt PR01" `# Accessors are implemented as classes, but we do not document the Parameters section` \
7375
-i "pandas.Period.freq GL08" \
7476
-i "pandas.Period.ordinal GL08" \
77+
-i "pandas.errors.IncompatibleFrequency SA01,SS06,EX01" \
7578
-i "pandas.core.groupby.DataFrameGroupBy.plot PR02" \
7679
-i "pandas.core.groupby.SeriesGroupBy.plot PR02" \
7780
-i "pandas.core.resample.Resampler.quantile PR01,PR07" \

ci/deps/actions-pypy-39.yaml

Lines changed: 0 additions & 26 deletions
This file was deleted.

doc/source/reference/testing.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ Exceptions and warnings
3636
errors.DuplicateLabelError
3737
errors.EmptyDataError
3838
errors.IncompatibilityWarning
39+
errors.IncompatibleFrequency
3940
errors.IndexingError
4041
errors.InvalidColumnName
4142
errors.InvalidComparison

doc/source/user_guide/basics.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -592,7 +592,7 @@ arguments. The special value ``all`` can also be used:
592592

593593
.. ipython:: python
594594
595-
frame.describe(include=["object"])
595+
frame.describe(include=["str"])
596596
frame.describe(include=["number"])
597597
frame.describe(include="all")
598598

doc/source/user_guide/indexing.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -700,7 +700,7 @@ to have different probabilities, you can pass the ``sample`` function sampling w
700700
701701
s = pd.Series([0, 1, 2, 3, 4, 5])
702702
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
703-
s.sample(n=3, weights=example_weights)
703+
s.sample(n=2, weights=example_weights)
704704
705705
# Weights will be re-normalized automatically
706706
example_weights2 = [0.5, 0, 0, 0, 0, 0]
@@ -714,7 +714,7 @@ as a string.
714714
715715
df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
716716
'weight_column': [0.5, 0.4, 0.1, 0]})
717-
df2.sample(n=3, weights='weight_column')
717+
df2.sample(n=2, weights='weight_column')
718718
719719
``sample`` also allows users to sample columns instead of rows using the ``axis`` argument.
720720

doc/source/user_guide/io.rst

Lines changed: 27 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5228,33 +5228,32 @@ languages easy. Parquet can use a variety of compression techniques to shrink th
52285228
while still maintaining good read performance.
52295229

52305230
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
5231-
dtypes, including extension dtypes such as datetime with tz.
5231+
dtypes, including extension dtypes such as datetime with timezone.
52325232

52335233
Several caveats.
52345234

52355235
* Duplicate column names and non-string columns names are not supported.
5236-
* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default
5237-
indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238-
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
5236+
* The DataFrame index is written as separate column(s) when it is a non-default range index.
5237+
This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238+
force including or omitting indexes with the ``index`` argument.
52395239
* Index level names, if specified, must be strings.
52405240
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
5241-
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
5242-
* Non supported types include ``Interval`` and actual Python object types. These will raise a helpful error message
5243-
on an attempt at serialization. ``Period`` type is supported with pyarrow >= 0.16.0.
5241+
* The ``pyarrow`` engine supports the ``Period`` and ``Interval`` dtypes. ``fastparquet`` does not support those.
5242+
* Non supported types include actual Python object types. These will raise a helpful error message
5243+
on an attempt at serialization.
52445244
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data
5245-
type (requiring pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols,
5245+
type (this can also work for external extension types, requiring the extension type to implement the needed protocols,
52465246
see the :ref:`extension types documentation <extending.extension.arrow>`).
52475247

52485248
You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
52495249
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
5250-
then ``pyarrow`` is tried, and falling back to ``fastparquet``.
5250+
then ``pyarrow`` is used when installed, and falling back to ``fastparquet``.
52515251

52525252
See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__.
52535253

52545254
.. note::
52555255

5256-
These engines are very similar and should read/write nearly identical parquet format files.
5257-
``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
5256+
These engines are very similar and should read/write nearly identical parquet format files for most cases.
52585257
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
52595258

52605259
.. ipython:: python
@@ -5280,24 +5279,21 @@ Write to a parquet file.
52805279

52815280
.. ipython:: python
52825281
5283-
df.to_parquet("example_pa.parquet", engine="pyarrow")
5284-
df.to_parquet("example_fp.parquet", engine="fastparquet")
5282+
# specify engine="pyarrow" or engine="fastparquet" to use a specific engine
5283+
df.to_parquet("example.parquet")
52855284
52865285
Read from a parquet file.
52875286

52885287
.. ipython:: python
52895288
5290-
result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
5291-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow")
5292-
5289+
result = pd.read_parquet("example.parquet")
52935290
result.dtypes
52945291
52955292
By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame.
52965293

52975294
.. ipython:: python
52985295
5299-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow")
5300-
5296+
result = pd.read_parquet("example.parquet", dtype_backend="pyarrow")
53015297
result.dtypes
53025298
53035299
.. note::
@@ -5309,41 +5305,36 @@ Read only certain columns of a parquet file.
53095305

53105306
.. ipython:: python
53115307
5312-
result = pd.read_parquet(
5313-
"example_fp.parquet",
5314-
engine="fastparquet",
5315-
columns=["a", "b"],
5316-
)
5317-
result = pd.read_parquet(
5318-
"example_pa.parquet",
5319-
engine="pyarrow",
5320-
columns=["a", "b"],
5321-
)
5308+
result = pd.read_parquet("example.parquet", columns=["a", "b"])
53225309
result.dtypes
53235310
53245311
53255312
.. ipython:: python
53265313
:suppress:
53275314
5328-
os.remove("example_pa.parquet")
5329-
os.remove("example_fp.parquet")
5315+
os.remove("example.parquet")
53305316
53315317
53325318
Handling indexes
53335319
''''''''''''''''
53345320

53355321
Serializing a ``DataFrame`` to parquet may include the implicit index as one or
5336-
more columns in the output file. Thus, this code:
5322+
more columns in the output file. For example, this code:
53375323

53385324
.. ipython:: python
53395325
5340-
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
5326+
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
53415327
df.to_parquet("test.parquet", engine="pyarrow")
53425328
5343-
creates a parquet file with *three* columns if you use ``pyarrow`` for serialization:
5344-
``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the
5345-
index `may or may not <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`_
5346-
be written to the file.
5329+
creates a parquet file with *three* columns (``a``, ``b``, and
5330+
``__index_level_0__`` when using the ``pyarrow`` engine, or ``index``, ``a``,
5331+
and ``b`` when using the ``fastparquet`` engine) because the index in this case
5332+
is not a default range index. In general, the index *may or may not* be written
5333+
to the file (see the
5334+
`preserve_index keyword for pyarrow <https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes>`__
5335+
or the
5336+
`write_index keyword for fastparquet <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`__
5337+
to check the default behaviour).
53475338

53485339
This unexpected extra column causes some databases like Amazon Redshift to reject
53495340
the file, because that column doesn't exist in the target table.

doc/source/user_guide/timeseries.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2541,7 +2541,7 @@ Fold is supported only for constructing from naive ``datetime.datetime``
25412541
or for constructing from components (see below). Only ``dateutil`` timezones are supported
25422542
(see `dateutil documentation <https://dateutil.readthedocs.io/en/stable/tz.html#dateutil.tz.enfold>`__
25432543
for ``dateutil`` methods that deal with ambiguous datetimes) as ``pytz``
2544-
timezones do not support fold (see `pytz documentation <http://pytz.sourceforge.net/index.html>`__
2544+
timezones do not support fold (see `pytz documentation <https://pythonhosted.org/pytz/>`__
25452545
for details on how ``pytz`` deals with ambiguous datetimes). To localize an ambiguous datetime
25462546
with ``pytz``, please use :meth:`Timestamp.tz_localize`. In general, we recommend to rely
25472547
on :meth:`Timestamp.tz_localize` when localizing ambiguous datetimes if you need direct

0 commit comments

Comments
 (0)