Skip to content

DOC: Simplify pandas theme footer #61843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
656 changes: 656 additions & 0 deletions .cursor/rules/contributing-to-codebase.mdc

Large diffs are not rendered by default.

14 changes: 14 additions & 0 deletions .cursor/rules/copy-on-write-mechanism.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
alwaysApply: true
---
Copy on write
Copy on Write is a mechanism to simplify the indexing API and improve performance through avoiding copies if possible. CoW means that any DataFrame or Series derived from another in any way always behaves as a copy. An explanation on how to use Copy on Write efficiently can be found here.

Reference tracking
To be able to determine if we have to make a copy when writing into a DataFrame, we have to be aware if the values are shared with another DataFrame. pandas keeps track of all Blocks that share values with another block internally to be able to tell when a copy needs to be triggered. The reference tracking mechanism is implemented on the Block level.

We use a custom reference tracker object, BlockValuesRefs, that keeps track of every block, whose values share memory with each other. The reference is held through a weak-reference. Every pair of blocks that share some memory should point to the same BlockValuesRefs object. If one block goes out of scope, the reference to this block dies. As a consequence, the reference tracker object always knows how many blocks are alive and share memory.

Whenever a DataFrame or Series object is sharing data with another object, it is required that each of those objects have its own BlockManager and Block objects. Thus, in other words, one Block instance (that is held by a DataFrame, not necessarily for intermediate objects) should always be uniquely used for only a single DataFrame/Series object. For example, when you want to use the same Block for another object, you can create a shallow copy of the Block instance with block.copy(deep=False) (which will create a new Block instance with the same underlying values and which will correctly set up the references).

We can ask the reference tracking object if there is another block alive that shares data with us before writing into the values. We can trigger a copy before writing if there is in fact another block alive.
212 changes: 212 additions & 0 deletions .cursor/rules/creating-developement-environment.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
---
description: when ever agent needs Creating a development environment
alwaysApply: false
---
Creating a development environment
To test out code changes, you’ll need to build pandas from source, which requires a C/C++ compiler and Python environment. If you’re making documentation changes, you can skip to contributing to the documentation but if you skip creating the development environment you won’t be able to build the documentation locally before pushing your changes. It’s recommended to also install the pre-commit hooks.

Step 1: install a C compiler
How to do this will depend on your platform. If you choose to use Docker or GitPod in the next step, then you can skip this step.

Windows

You will need Build Tools for Visual Studio 2022.

Note

You DO NOT need to install Visual Studio 2022. You only need “Build Tools for Visual Studio 2022” found by scrolling down to “All downloads” -> “Tools for Visual Studio”. In the installer, select the “Desktop development with C++” Workloads.

If you encounter an error indicating cl.exe is not found when building with Meson, reopen the installer and also select the optional component MSVC v142 - VS 2019 C++ x64/x86 build tools in the right pane for installation.

Alternatively, you can install the necessary components on the commandline using vs_BuildTools.exe

Alternatively, you could use the WSL and consult the Linux instructions below.

macOS

To use the conda-based compilers, you will need to install the Developer Tools using xcode-select --install.

If you prefer to use a different compiler, general information can be found here: https://devguide.python.org/setup/#macos

Linux

For Linux-based conda installations, you won’t have to install any additional components outside of the conda environment. The instructions below are only needed if your setup isn’t based on conda environments.

Some Linux distributions will come with a pre-installed C compiler. To find out which compilers (and versions) are installed on your system:

# for Debian/Ubuntu:
dpkg --list | grep compiler
# for Red Hat/RHEL/CentOS/Fedora:
yum list installed | grep -i --color compiler
GCC (GNU Compiler Collection), is a widely used compiler, which supports C and a number of other languages. If GCC is listed as an installed compiler nothing more is required.

If no C compiler is installed, or you wish to upgrade, or you’re using a different Linux distribution, consult your favorite search engine for compiler installation/update instructions.

Let us know if you have any difficulties by opening an issue or reaching out on our contributor community Slack.

Step 2: create an isolated environment
Before we begin, please:

Make sure that you have cloned the repository

cd to the pandas source directory you just created with the clone command

Option 1: using conda (recommended)
Install miniforge to get conda

Create and activate the pandas-dev conda environment using the following commands:

conda env create --file environment.yml
conda activate pandas-dev
Option 2: using pip
You’ll need to have at least the minimum Python version that pandas supports. You also need to have setuptools 51.0.0 or later to build pandas.

Unix/macOS with virtualenv

# Create a virtual environment
# Use an ENV_DIR of your choice. We'll use ~/virtualenvs/pandas-dev
# Any parent directories should already exist
python3 -m venv ~/virtualenvs/pandas-dev

# Activate the virtualenv
. ~/virtualenvs/pandas-dev/bin/activate

# Install the build dependencies
python -m pip install -r requirements-dev.txt
Unix/macOS with pyenv

Consult the docs for setting up pyenv here.

# Create a virtual environment
# Use an ENV_DIR of your choice. We'll use ~/Users/<yourname>/.pyenv/versions/pandas-dev
pyenv virtualenv <version> <name-to-give-it>

# For instance:
pyenv virtualenv 3.10 pandas-dev

# Activate the virtualenv
pyenv activate pandas-dev

# Now install the build dependencies in the cloned pandas repo
python -m pip install -r requirements-dev.txt
Windows

Below is a brief overview on how to set-up a virtual environment with Powershell under Windows. For details please refer to the official virtualenv user guide.

Use an ENV_DIR of your choice. We’ll use ~\\virtualenvs\\pandas-dev where ~ is the folder pointed to by either $env:USERPROFILE (Powershell) or %USERPROFILE% (cmd.exe) environment variable. Any parent directories should already exist.

# Create a virtual environment
python -m venv $env:USERPROFILE\virtualenvs\pandas-dev

# Activate the virtualenv. Use activate.bat for cmd.exe
~\virtualenvs\pandas-dev\Scripts\Activate.ps1

# Install the build dependencies
python -m pip install -r requirements-dev.txt
Option 3: using Docker
pandas provides a DockerFile in the root directory to build a Docker image with a full pandas development environment.

Docker Commands

Build the Docker image:

# Build the image
docker build -t pandas-dev .
Run Container:

# Run a container and bind your local repo to the container
# This command assumes you are running from your local repo
# but if not alter ${PWD} to match your local repo path
docker run -it --rm -v ${PWD}:/home/pandas pandas-dev
Even easier, you can integrate Docker with the following IDEs:

Visual Studio Code

You can use the DockerFile to launch a remote session with Visual Studio Code, a popular free IDE, using the .devcontainer.json file. See https://code.visualstudio.com/docs/remote/containers for details.

PyCharm (Professional)

Enable Docker support and use the Services tool window to build and manage images as well as run and interact with containers. See https://www.jetbrains.com/help/pycharm/docker.html for details.

Option 4: using Gitpod
Gitpod is an open-source platform that automatically creates the correct development environment right in your browser, reducing the need to install local development environments and deal with incompatible dependencies.

If you are a Windows user, unfamiliar with using the command line or building pandas for the first time, it is often faster to build with Gitpod. Here are the in-depth instructions for building pandas with GitPod.

Step 3: build and install pandas
There are currently two supported ways of building pandas, pip/meson and setuptools(setup.py). Historically, pandas has only supported using setuptools to build pandas. However, this method requires a lot of convoluted code in setup.py and also has many issues in compiling pandas in parallel due to limitations in setuptools.

The newer build system, invokes the meson backend through pip (via a PEP 517 build). It automatically uses all available cores on your CPU, and also avoids the need for manual rebuilds by rebuilding automatically whenever pandas is imported (with an editable install).

For these reasons, you should compile pandas with meson. Because the meson build system is newer, you may find bugs/minor issues as it matures. You can report these bugs here.

To compile pandas with meson, run:

# Build and install pandas
# By default, this will print verbose output
# showing the "rebuild" taking place on import (see section below for explanation)
# If you do not want to see this, omit everything after --no-build-isolation
python -m pip install -ve . --no-build-isolation -Ceditable-verbose=true
Note

The version number is pulled from the latest repository tag. Be sure to fetch the latest tags from upstream before building:

# set the upstream repository, if not done already, and fetch the latest tags
git remote add upstream https://github.com/pandas-dev/pandas.git
git fetch upstream --tags
Build options

It is possible to pass options from the pip frontend to the meson backend if you would like to configure your install. Occasionally, you’ll want to use this to adjust the build directory, and/or toggle debug/optimization levels.

You can pass a build directory to pandas by appending -Cbuilddir="your builddir here" to your pip command. This option allows you to configure where meson stores your built C extensions, and allows for fast rebuilds.

Sometimes, it might be useful to compile pandas with debugging symbols, when debugging C extensions. Appending -Csetup-args="-Ddebug=true" will do the trick.

With pip, it is possible to chain together multiple config settings. For example, specifying both a build directory and building with debug symbols would look like -Cbuilddir="your builddir here" -Csetup-args="-Dbuildtype=debug".

Compiling pandas with setup.py

Note

This method of compiling pandas will be deprecated and removed very soon, as the meson backend matures.

To compile pandas with setuptools, run:

python setup.py develop
Note

If pandas is already installed (via meson), you have to uninstall it first:

python -m pip uninstall pandas
This is because python setup.py develop will not uninstall the loader script that meson-python uses to import the extension from the build folder, which may cause errors such as an FileNotFoundError to be raised.

Note

You will need to repeat this step each time the C extensions change, for example if you modified any file in pandas/_libs or if you did a fetch and merge from upstream/main.

Checking the build

At this point you should be able to import pandas from your locally built version:

$ python
>>> import pandas
>>> print(pandas.__version__) # note: the exact output may differ
2.0.0.dev0+880.g2b9e661fbb.dirty
At this point you may want to try running the test suite.

Keeping up to date with the latest build

When building pandas with meson, importing pandas will automatically trigger a rebuild, even when C/Cython files are modified. By default, no output will be produced by this rebuild (the import will just take longer). If you would like to see meson’s output when importing pandas, you can set the environment variable MESONPY_EDITABLE_VERBOSE. For example, this would be:

# On Linux/macOS
MESONPY_EDITABLE_VERBOSE=1 python

# Windows
set MESONPY_EDITABLE_VERBOSE=1 # Only need to set this once per session
python
If you would like to see this verbose output every time, you can set the editable-verbose config setting to true like so:

python -m pip install -ve . -Ceditable-verbose=true
Tip

If you ever find yourself wondering whether setuptools or meson was used to build your pandas, you can check the value of pandas._built_with_meson, which will be true if meson was used to compile pandas.
46 changes: 46 additions & 0 deletions .cursor/rules/debugging-c-extentions.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
description: when ever want to debug c extention
alwaysApply: false
---
Debugging C extensions
pandas uses Cython and C/C++ extension modules to optimize performance. Unfortunately, the standard Python debugger does not allow you to step into these extensions. Cython extensions can be debugged with the Cython debugger and C/C++ extensions can be debugged using the tools shipped with your platform’s compiler.

For Python developers with limited or no C/C++ experience this can seem a daunting task. Core developer Will Ayd has written a 3 part blog series to help guide you from the standard Python debugger into these other tools:

Fundamental Python Debugging Part 1 - Python

Fundamental Python Debugging Part 2 - Python Extensions

Fundamental Python Debugging Part 3 - Cython Extensions

Debugging locally
By default building pandas from source will generate a release build. To generate a development build you can type:

pip install -ve . --no-build-isolation -Cbuilddir="debug" -Csetup-args="-Dbuildtype=debug"
Note

conda environments update CFLAGS/CPPFLAGS with flags that are geared towards generating releases, and may work counter towards usage in a development environment. If using conda, you should unset these environment variables via export CFLAGS= and export CPPFLAGS=

By specifying builddir="debug" all of the targets will be built and placed in the debug directory relative to the project root. This helps to keep your debug and release artifacts separate; you are of course able to choose a different directory name or omit altogether if you do not care to separate build types.

Using Docker
To simplify the debugging process, pandas has created a Docker image with a debug build of Python and the gdb/Cython debuggers pre-installed. You may either docker pull pandas/pandas-debug to get access to this image or build it from the tooling/debug folder locally.

You can then mount your pandas repository into this image via:

docker run --rm -it -w /data -v ${PWD}:/data pandas/pandas-debug
Inside the image, you can use meson to build/install pandas and place the build artifacts into a debug folder using a command as follows:

python -m pip install -ve . --no-build-isolation -Cbuilddir="debug" -Csetup-args="-Dbuildtype=debug"
If planning to use cygdb, the files required by that application are placed within the build folder. So you have to first cd to the build folder, then start that application.

cd debug
cygdb
Within the debugger you can use cygdb commands to navigate cython extensions.

Editor support
The meson build system generates a compilation database automatically and places it in the build directory. Many language servers and IDEs can use this information to provide code-completion, go-to-definition and error checking support as you type.

How each language server / IDE chooses to look for the compilation database may vary. When in doubt you may want to create a symlink at the root of the project that points to the compilation database in your build directory. Assuming you used debug as your directory name, you can run:

ln -s debug/compile_commands.json .
Loading
Loading