Overview

This tool converts individual patient records into structured time-interval feature vectors, making them suitable for filtering, aggregation, and assembly into a data matrix D for binary classification machine learning tasks.

Documentation

The full API documentation for pat2vec is automatically generated and hosted on GitHub Pages.

📖 View the Live Documentation

Example Use Cases

1. Patient-Level Aggregation

Compute summary statistics (e.g., the mean of n variables) for each unique patient, resulting in one row per patient. This is ideal for models requiring a single representation per individual.

2. Longitudinal Time Series Construction

Generate a monthly time series for each patient that includes:

Biochemistry results
Demographic attributes
MedCat-derived clinical text annotations

The time series spans up to 25 years retrospectively, aligned to each patient's diagnosis date, enabling a consistent retrospective view across varying start times.

Requirements

Core Services:

CogStack: An operational instance for data retrieval. The required client libraries are now bundled with this project.
Elasticsearch: The backend for CogStack.
MedCAT: For medical concept annotation.

Local Setup:

Python: Version 3.10 or higher.
Virtual Environment: Requires the python3-venv package (or equivalent for your OS).
For all other Python packages, see requirements.txt.

Features

pat2vec offers a flexible suite of tools for processing and analyzing patient data.

Patient Processing

Single & Batch Processing: Process individual patients for detailed analysis or run large batches for cohort-level studies.

Cohort Management

Cohort Search & Creation: Define and build patient cohorts using flexible search criteria.
Automated Control Matching: Automatically generate random control groups for case-control studies.

Flexible Feature Engineering

Modular Feature Selection: Choose from a wide range of feature extractors to build a custom feature space tailored to your research question.
Temporal Windowing: Define precise time windows for data extraction relative to a key event (e.g., diagnosis date), including look-back and look-forward periods.

📊 Diagrams

Click to view project diagrams

This project includes a collection of diagrams illustrating the system architecture, data pipelines, and feature extraction workflows. You can view the Mermaid definitions or the rendered diagrams below.

📂 System Architecture & Configuration

Diagram	Mermaid	Image
System Architecture	assets/system_architecture.mmd
Configuration	assets/config.mmd

🛠️ Data Pipelines

Diagram	Mermaid	Image
Data Pipeline	assets/data_pipeline.mmd
Main Batch Processing	assets/main_batch.mmd
Example Ingestion	assets/example_ingestion.mmd

🧩 Methods & Post-Processing

Diagram	Mermaid	Image
Methods Annotation	assets/methods_annotation.mmd
Post-Processing Build Methods	assets/post_processing_build_methods.mmd
Post-Processing Anonymisation	assets/post_processing_anonymisation_high_level.mmd

🔍 Feature Extraction

Diagram	Mermaid	Image
Ethnicity Abstractor	assets/ethnicity_abstractor.mmd
Get BMI	assets/get_bmi.mmd
Get Demographics	assets/get_demographics.mmd
Get Diagnostics	assets/get_diagnostics.mmd
Get Drugs	assets/get_drugs.mmd
Get Smoking	assets/get_smoking.mmd
Get News	assets/get_news.mmd
Get Dummy Data Cohort Searcher	assets/get_dummy_data_cohort_searcher.mmd
Get Method Bloods	assets/get_method_bloods.mmd
Get Method Patient Annotations	assets/get_method_pat_annotations.mmd
Get Treatment Docs (No Terms Fuzzy)	assets/get_treatment_docs_by_iterative_multi_term_cohort_searcher_no_terms_fuzzy.mmd

Installation

From PyPI (Recommended for Users)

Once pat2vec is installed, you can use it as a library in your Python projects.

Install the package:
```
pip install pat2vec
```
Install all optional dependencies (for full functionality):
```
pip install pat2vec[all]
```

From Source (For Developers/Contributors)

The following instructions are for setting up a development environment from the source code.

Windows

Clone the repository: Navigate to the directory where you want to store your projects. It's recommended to have a parent directory to hold pat2vec and its related assets.
```
git clone https://github.com/SamoraHunter/pat2vec.git
```
Run the installation script: Navigate into the cloned repository and run the batch script. This will create a Python virtual environment, install dependencies from requirements.txt, and set up a Jupyter kernel.
```
cd pat2vec
install.bat
```
The script accepts several flags to customize the installation:
- /p or /proxy: Use if you are behind a corporate proxy.
- /dev: Installs development dependencies (e.g., pytest, sphinx).
- /a or /all: Installs all optional feature dependencies.
- /f or /force: Removes any existing virtual environment for a clean install.
- /no-clone: Skips cloning the snomed_methods helper repository.
Activate the environment: To use the installed packages, activate the virtual environment:
```
pat2vec_env\Scripts\activate
```
Set up for your IDE/Notebook: If you are using an IDE like VS Code or a Jupyter Notebook, make sure to select the pat2vec_env kernel to run your code.
Post-Installation Setup: The script sets up the Python environment, but you must manually arrange other project assets. In the parent directory of your pat2vec clone, you will need to:
- Clone the helper repository:
```
git clone https://github.com/SamoraHunter/snomed_methods.git
```
- Add MedCAT model: Create a medcat_models directory and copy your MedCAT model pack (.zip) into it.
- Add credentials: Create a credentials.py file. You can use pat2vec/pat2vec/config/credentials_template.py as a starting point.
Your final directory structure should look like the one described in the Usage section.

Unix/Linux

The install_pat2vec.sh script is the recommended way to set up a development environment on Unix-like systems. It automates the full setup, including:

Creating a Python virtual environment (pat2vec_env).
Installing Python dependencies (including development and testing tools).
Cloning the snomed_methods helper repository.
Creating required directories and template files (e.g., for MedCAT models and credentials).

To install, clone the repository, navigate into it, and run the script: Grant execution permissions and run the script. It must be run from within the pat2vec directory.

```shell
chmod +x install_pat2vec.sh
./install_pat2vec.sh
```

The script supports several options:
-   `--proxy`: Use if you are behind a corporate proxy that mirrors Python packages.
-   `--dev`: Installs development dependencies (e.g., `pytest`, `nbmake`) for running tests.
-   `--all`: Installs all optional feature dependencies.
-   `--force`: Removes any existing virtual environment and performs a clean installation.
-   `--no-clone`: Skips cloning the `snomed_methods` repository if you already have it.

For example, to install for development behind a proxy:
```shell
./install_pat2vec.sh --proxy --dev
```

After running the script, you must perform two manual steps: The script creates a directory structure in the parent folder of pat2vec. - Place MedCAT model: Copy your model pack into the medcat_models directory created by the script. - Populate credentials: Edit the credentials.py file created by the script and fill in your details.

Finally, activate the environment to begin working: shell source pat2vec_env/bin/activate

Usage

This guide outlines the steps to run a pat2vec analysis after completing the installation.

1. Finalise Project Setup

Before running an analysis, ensure your project directory is set up correctly. If you used the install_pat2vec.sh script, much of this is done for you.

Populate credentials.py: In the parent directory of your pat2vec clone, edit credentials.py with your Elasticsearch credentials.
Add MedCAT Model: Copy your MedCAT model pack (.zip) into the medcat_models directory.

Your final directory structure should look like this:

your_project_folder/
├── credentials.py              # <-- Populated with your credentials
├── medcat_models/
│   └── your_model.zip          # <-- Your MedCAT model pack
├── snomed_methods/             # <-- Cloned helper repository
└── pat2vec/                    # <-- This repository
    ├── notebooks/
    │   └── example_usage.ipynb
    └── ...

2. Prepare Input Data

Create a CSV file containing your patient cohort. This file must include:

A column named client_idcode with unique patient identifiers.
Any other relevant columns, such as a diagnosis date for aligning time series data.

Place this file in an accessible location, such as a new data folder inside pat2vec/notebooks/.

3. Configure and Run

The example_usage.ipynb notebook provides a template for running the pipeline.

Open the Notebook: Navigate to pat2vec/notebooks/ and open example_usage.ipynb.
Select the Kernel: Ensure the pat2vec_env Jupyter kernel is active.
Configure the Analysis: In the notebook, locate the config_class. This object controls all parameters for your run. You will need to set:
- Paths to your input cohort CSV and output directories.
- The list of features to extract.
- Time windows for data extraction (look-back/look-forward periods).
Run the Pipeline: Execute the cells in the notebook to process your data.

Note: When working with real patient data, ensure the testing flag in the config_class is set to False.

Building the Documentation

This project uses Sphinx to generate documentation from the source code's docstrings.

Install development dependencies: If you haven't already, run the installation script with the --dev flag to install Sphinx and its extensions.
```
./install_pat2vec.sh --dev
```
Activate the virtual environment:
```
source pat2vec_env/bin/activate
```
Build the HTML documentation: Navigate to the docs/ directory and use the provided Makefile.
```
cd docs
make html
```
View the documentation: The generated files will be in docs/build/html/. You can open the main page in your browser:
```
open docs/build/html/index.html
```

FAQ

For answers to common questions, troubleshooting tips, and more detailed explanations of project concepts, please see our Frequently Asked Questions page.

Frequently Asked Questions

Citation

If you use pat2vec in your research, please cite it. This helps to credit the work and allows others to find the tool.

@software{hunter_pat2vec_2024,
  author = {Hunter, Samora},
  title = {pat2vec: A tool for transforming EHR data into feature vectors for machine learning},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SamoraHunter/pat2vec}}
}

Contributing

Contributions are welcome! Please see the contributing guidelines for more information.

Code of Conduct

This project and everyone participating in it is governed by a Code of Conduct. By participating, you are expected to uphold this code. Please report any unacceptable behavior.

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 750 Commits
.githooks		.githooks
.github		.github
assets		assets
config		config
docs		docs
legacy		legacy
notebooks		notebooks
pat2vec		pat2vec
test_files		test_files
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
MANIFEST.in		MANIFEST.in
README.md		README.md
generate_init.py		generate_init.py
install.bat		install.bat
install_pat2vec.sh		install_pat2vec.sh
list_functions.py		list_functions.py
packages.txt		packages.txt
pyproject.toml		pyproject.toml
setup-hooks.sh		setup-hooks.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Documentation

Example Use Cases

1. Patient-Level Aggregation

2. Longitudinal Time Series Construction

Requirements

Features

📊 Diagrams

📂 System Architecture & Configuration

🛠️ Data Pipelines

🧩 Methods & Post-Processing

🔍 Feature Extraction

Installation

From PyPI (Recommended for Users)

From Source (For Developers/Contributors)

Windows

Unix/Linux

Usage

1. Finalise Project Setup

2. Prepare Input Data

3. Configure and Run

Building the Documentation

FAQ

Citation

Contributing

Code of Conduct

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

SamoraHunter/pat2vec

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Documentation

Example Use Cases

1. Patient-Level Aggregation

2. Longitudinal Time Series Construction

Requirements

Features

📊 Diagrams

📂 System Architecture & Configuration

🛠️ Data Pipelines

🧩 Methods & Post-Processing

🔍 Feature Extraction

Installation

From PyPI (Recommended for Users)

From Source (For Developers/Contributors)

Windows

Unix/Linux

Usage

1. Finalise Project Setup

2. Prepare Input Data

3. Configure and Run

Building the Documentation

FAQ

Citation

Contributing

Code of Conduct

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages