- Overview
- Documentation
- Example Use Cases
- Requirements
- Features
- π Diagrams
- Installation
- Usage
- FAQ
- Citation
- Contributing
- Code of Conduct
- License
This tool converts individual patient records into structured time-interval feature vectors, making them suitable for filtering, aggregation, and assembly into a data matrix D for binary classification machine learning tasks.
The full API documentation for pat2vec is automatically generated and hosted on GitHub Pages.
π View the Live Documentation
Compute summary statistics (e.g., the mean of n variables) for each unique patient, resulting in one row per patient. This is ideal for models requiring a single representation per individual.
Generate a monthly time series for each patient that includes:
- Biochemistry results
- Demographic attributes
- MedCat-derived clinical text annotations
The time series spans up to 25 years retrospectively, aligned to each patient's diagnosis date, enabling a consistent retrospective view across varying start times.
Core Services:
- CogStack: An operational instance for data retrieval. The required client libraries are now bundled with this project.
- Elasticsearch: The backend for CogStack.
- MedCAT: For medical concept annotation.
Local Setup:
- Python: Version 3.10 or higher.
- Virtual Environment: Requires the
python3-venvpackage (or equivalent for your OS). - For all other Python packages, see
requirements.txt.
pat2vec offers a flexible suite of tools for processing and analyzing patient data.
Patient Processing
- Single & Batch Processing: Process individual patients for detailed analysis or run large batches for cohort-level studies.
Cohort Management
- Cohort Search & Creation: Define and build patient cohorts using flexible search criteria.
- Automated Control Matching: Automatically generate random control groups for case-control studies.
Flexible Feature Engineering
- Modular Feature Selection: Choose from a wide range of feature extractors to build a custom feature space tailored to your research question.
- Temporal Windowing: Define precise time windows for data extraction relative to a key event (e.g., diagnosis date), including look-back and look-forward periods.
Click to view project diagrams
This project includes a collection of diagrams illustrating the system architecture, data pipelines, and feature extraction workflows. You can view the Mermaid definitions or the rendered diagrams below.
| Diagram | Mermaid | Image |
|---|---|---|
| System Architecture | assets/system_architecture.mmd | ![]() |
| Configuration | assets/config.mmd |
| Diagram | Mermaid | Image |
|---|---|---|
| Data Pipeline | assets/data_pipeline.mmd | ![]() |
| Main Batch Processing | assets/main_batch.mmd | |
| Example Ingestion | assets/example_ingestion.mmd | ![]() |
| Diagram | Mermaid | Image |
|---|---|---|
| Methods Annotation | assets/methods_annotation.mmd | ![]() |
| Post-Processing Build Methods | assets/post_processing_build_methods.mmd | |
| Post-Processing Anonymisation | assets/post_processing_anonymisation_high_level.mmd |
| Diagram | Mermaid | Image |
|---|---|---|
| Ethnicity Abstractor | assets/ethnicity_abstractor.mmd | |
| Get BMI | assets/get_bmi.mmd | |
| Get Demographics | assets/get_demographics.mmd | |
| Get Diagnostics | assets/get_diagnostics.mmd | |
| Get Drugs | assets/get_drugs.mmd | |
| Get Smoking | assets/get_smoking.mmd | |
| Get News | assets/get_news.mmd | |
| Get Dummy Data Cohort Searcher | assets/get_dummy_data_cohort_searcher.mmd | |
| Get Method Bloods | assets/get_method_bloods.mmd | |
| Get Method Patient Annotations | assets/get_method_pat_annotations.mmd | |
| Get Treatment Docs (No Terms Fuzzy) | assets/get_treatment_docs_by_iterative_multi_term_cohort_searcher_no_terms_fuzzy.mmd |
Once pat2vec is installed, you can use it as a library in your Python projects.
-
Install the package:
pip install pat2vec
-
Install all optional dependencies (for full functionality):
pip install pat2vec[all]
The following instructions are for setting up a development environment from the source code.
-
Clone the repository: Navigate to the directory where you want to store your projects. It's recommended to have a parent directory to hold
pat2vecand its related assets.git clone https://github.com/SamoraHunter/pat2vec.git
-
Run the installation script: Navigate into the cloned repository and run the batch script. This will create a Python virtual environment, install dependencies from
requirements.txt, and set up a Jupyter kernel.cd pat2vec install.batThe script accepts several flags to customize the installation:
/por/proxy: Use if you are behind a corporate proxy./dev: Installs development dependencies (e.g.,pytest,sphinx)./aor/all: Installs all optional feature dependencies./for/force: Removes any existing virtual environment for a clean install./no-clone: Skips cloning thesnomed_methodshelper repository.
-
Activate the environment: To use the installed packages, activate the virtual environment:
pat2vec_env\Scripts\activate
-
Set up for your IDE/Notebook: If you are using an IDE like VS Code or a Jupyter Notebook, make sure to select the
pat2vec_envkernel to run your code. -
Post-Installation Setup: The script sets up the Python environment, but you must manually arrange other project assets. In the parent directory of your
pat2vecclone, you will need to:- Clone the helper repository:
git clone https://github.com/SamoraHunter/snomed_methods.git
- Add MedCAT model: Create a
medcat_modelsdirectory and copy your MedCAT model pack (.zip) into it. - Add credentials: Create a
credentials.pyfile. You can usepat2vec/pat2vec/config/credentials_template.pyas a starting point.
Your final directory structure should look like the one described in the Usage section.
- Clone the helper repository:
The install_pat2vec.sh script is the recommended way to set up a development environment on Unix-like systems. It automates the full setup, including:
- Creating a Python virtual environment (
pat2vec_env). - Installing Python dependencies (including development and testing tools).
- Cloning the
snomed_methodshelper repository. - Creating required directories and template files (e.g., for MedCAT models and credentials).
To install, clone the repository, navigate into it, and run the script:
Grant execution permissions and run the script. It must be run from within the pat2vec directory.
```shell
chmod +x install_pat2vec.sh
./install_pat2vec.sh
```
The script supports several options:
- `--proxy`: Use if you are behind a corporate proxy that mirrors Python packages.
- `--dev`: Installs development dependencies (e.g., `pytest`, `nbmake`) for running tests.
- `--all`: Installs all optional feature dependencies.
- `--force`: Removes any existing virtual environment and performs a clean installation.
- `--no-clone`: Skips cloning the `snomed_methods` repository if you already have it.
For example, to install for development behind a proxy:
```shell
./install_pat2vec.sh --proxy --dev
```
After running the script, you must perform two manual steps:
The script creates a directory structure in the parent folder of pat2vec.
- Place MedCAT model: Copy your model pack into the medcat_models directory created by the script.
- Populate credentials: Edit the credentials.py file created by the script and fill in your details.
Finally, activate the environment to begin working:
shell source pat2vec_env/bin/activate
This guide outlines the steps to run a pat2vec analysis after completing the installation.
Before running an analysis, ensure your project directory is set up correctly. If you used the install_pat2vec.sh script, much of this is done for you.
- Populate
credentials.py: In the parent directory of yourpat2vecclone, editcredentials.pywith your Elasticsearch credentials. - Add MedCAT Model: Copy your MedCAT model pack (
.zip) into themedcat_modelsdirectory.
Your final directory structure should look like this:
your_project_folder/
βββ credentials.py # <-- Populated with your credentials
βββ medcat_models/
β βββ your_model.zip # <-- Your MedCAT model pack
βββ snomed_methods/ # <-- Cloned helper repository
βββ pat2vec/ # <-- This repository
βββ notebooks/
β βββ example_usage.ipynb
βββ ...
Create a CSV file containing your patient cohort. This file must include:
- A column named
client_idcodewith unique patient identifiers. - Any other relevant columns, such as a diagnosis date for aligning time series data.
Place this file in an accessible location, such as a new data folder inside pat2vec/notebooks/.
The example_usage.ipynb notebook provides a template for running the pipeline.
- Open the Notebook: Navigate to
pat2vec/notebooks/and openexample_usage.ipynb. - Select the Kernel: Ensure the
pat2vec_envJupyter kernel is active. - Configure the Analysis: In the notebook, locate the
config_class. This object controls all parameters for your run. You will need to set:- Paths to your input cohort CSV and output directories.
- The list of features to extract.
- Time windows for data extraction (look-back/look-forward periods).
- Run the Pipeline: Execute the cells in the notebook to process your data.
Note: When working with real patient data, ensure the
testingflag in theconfig_classis set toFalse.
This project uses Sphinx to generate documentation from the source code's docstrings.
-
Install development dependencies: If you haven't already, run the installation script with the
--devflag to install Sphinx and its extensions../install_pat2vec.sh --dev
-
Activate the virtual environment:
source pat2vec_env/bin/activate -
Build the HTML documentation: Navigate to the
docs/directory and use the providedMakefile.cd docs make html -
View the documentation: The generated files will be in
docs/build/html/. You can open the main page in your browser:open docs/build/html/index.html
For answers to common questions, troubleshooting tips, and more detailed explanations of project concepts, please see our Frequently Asked Questions page.
If you use pat2vec in your research, please cite it. This helps to credit the work and allows others to find the tool.
@software{hunter_pat2vec_2024,
author = {Hunter, Samora},
title = {pat2vec: A tool for transforming EHR data into feature vectors for machine learning},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/SamoraHunter/pat2vec}}
}Contributions are welcome! Please see the contributing guidelines for more information.
This project and everyone participating in it is governed by a Code of Conduct. By participating, you are expected to uphold this code. Please report any unacceptable behavior.
This project is licensed under the MIT License - see the LICENSE file for details



