Materials for Robust Pandas Tutorial, EuroPython, Prague, 2023.
We will explore possibilities for making our data analyses and transformations in Pandas robust and production ready. We will see how advanced group-by, resample or rolling aggregations work on large time series weather data. (As a bonus, you will learn about Prague climate.) We will use type annotations and schema validations with the Pandera library to make our code more readable and robust. We will also show the potential of property-based testing using the Hypothesis package, with strategies generated from Pandera schemas. We will show how to avoid issues with time zones when working with time series data. By the end of the tutorial, you will have a deeper understanding of advanced Pandas aggregations and be able to write robust, production ready Pandas code.
Two data sources are used in this workshop:
Please prepare a Python environment that you can use during the workshop. We will work in Jupyter Notebook as well as in an editor or an IDE of your choice. Recommended are Visual Studio Code or PyCharm.
Note: All the instructions below are for Unix-like systems (Linux, macOS, WSL on Windows).
If you want / need to work in Windows native cmd or PowerShell, you will need to adapt the commands accordingly.
We cannot provide support for Windows native environments.
git clone https://github.com/coobas/robust-pandas-workshop.gitor using gh client:
gh repo clone coobas/robust-pandas-workshopWe have included either requirements.txt or environment.yml files for you to create a Python environment
using either pip or conda respectively.
Python version 3.10+ is required.
First, cd into the repository directory:
cd robust-pandas-workshoppython -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txtconda env create -f environment.yml -n "robust-pandas-workshop"
conda activate robust-pandas-workshopCode for this workshop is in the weatherlyser package in this repository.
Before working with it, either in Jupyter notebooks, in your IDE, or when running tests,
Python needs to know about it.
Either set the PYTHONPATH environment variable to the repository directory:
export PYTHONPATH=$PWD(this of course assumes your current directory in the repository root)
or, which is more robust, install the package in editable mode:
pip install -e .Follow the instructions therein and if you do not have it, create a free Deepnote account.
All materials that we will use during the workshop are in Jupyter notebooks.
- Introduction
- First data exploration
- Type annotations and dataframe models
- Data loading
- Time zones
- Hypothesis testing
- Grouping, resampling and aggregations
- Windowing and differences
Visual Studio Code or PyCharm Professional users can work with notebooks directly in their IDE; this is the recommended way. You can also use Jupyter Lab, which will be installed in your environment and features an IDE environment too with and editor and command line.
The tests directory contains tests for the weatherlyser package.
We will use the tests throughout the workshop to test our code.
It is also a good idea to run the tests to check whether your installation is working correctly.
To run tests, use pytest:
pytestThe mypy static type checker is configured to check the weatherlyser and tests folders.
You can run it with:
mypy