Skip to content

SamoraHunter/pat2vec

Repository files navigation

Documentation Status License: MIT Python 3.10+

Table of Contents

Overview

This tool converts individual patient records into structured time-interval feature vectors, making them suitable for filtering, aggregation, and assembly into a data matrix D for binary classification machine learning tasks.

Documentation

The full API documentation for pat2vec is automatically generated and hosted on GitHub Pages.

πŸ“– View the Live Documentation

Example Use Cases

1. Patient-Level Aggregation

Compute summary statistics (e.g., the mean of n variables) for each unique patient, resulting in one row per patient. This is ideal for models requiring a single representation per individual.

2. Longitudinal Time Series Construction

Generate a monthly time series for each patient that includes:

  • Biochemistry results
  • Demographic attributes
  • MedCat-derived clinical text annotations

The time series spans up to 25 years retrospectively, aligned to each patient's diagnosis date, enabling a consistent retrospective view across varying start times.

Requirements

Core Services:

  • CogStack: An operational instance for data retrieval. The required client libraries are now bundled with this project.
  • Elasticsearch: The backend for CogStack.
  • MedCAT: For medical concept annotation.

Local Setup:

  • Python: Version 3.10 or higher.
  • Virtual Environment: Requires the python3-venv package (or equivalent for your OS).
  • For all other Python packages, see requirements.txt.

Features

pat2vec offers a flexible suite of tools for processing and analyzing patient data.

Patient Processing

  • Single & Batch Processing: Process individual patients for detailed analysis or run large batches for cohort-level studies.

Cohort Management

  • Cohort Search & Creation: Define and build patient cohorts using flexible search criteria.
  • Automated Control Matching: Automatically generate random control groups for case-control studies.

Flexible Feature Engineering

  • Modular Feature Selection: Choose from a wide range of feature extractors to build a custom feature space tailored to your research question.
  • Temporal Windowing: Define precise time windows for data extraction relative to a key event (e.g., diagnosis date), including look-back and look-forward periods.

πŸ“Š Diagrams

Click to view project diagrams

This project includes a collection of diagrams illustrating the system architecture, data pipelines, and feature extraction workflows. You can view the Mermaid definitions or the rendered diagrams below.

πŸ“‚ System Architecture & Configuration

Diagram Mermaid Image
System Architecture assets/system_architecture.mmd System Architecture
Configuration assets/config.mmd Configuration

πŸ› οΈ Data Pipelines

Diagram Mermaid Image
Data Pipeline assets/data_pipeline.mmd Data Pipeline
Main Batch Processing assets/main_batch.mmd Main Batch
Example Ingestion assets/example_ingestion.mmd Example Ingestion

🧩 Methods & Post-Processing

Diagram Mermaid Image
Methods Annotation assets/methods_annotation.mmd Methods Annotation
Post-Processing Build Methods assets/post_processing_build_methods.mmd Post-Processing Build Methods
Post-Processing Anonymisation assets/post_processing_anonymisation_high_level.mmd Post-Processing Anonymisation

πŸ” Feature Extraction

Diagram Mermaid Image
Ethnicity Abstractor assets/ethnicity_abstractor.mmd Ethnicity Abstractor
Get BMI assets/get_bmi.mmd Get BMI
Get Demographics assets/get_demographics.mmd Get Demographics
Get Diagnostics assets/get_diagnostics.mmd Get Diagnostics
Get Drugs assets/get_drugs.mmd Get Drugs
Get Smoking assets/get_smoking.mmd Get Smoking
Get News assets/get_news.mmd Get News
Get Dummy Data Cohort Searcher assets/get_dummy_data_cohort_searcher.mmd Get Dummy Data Cohort Searcher
Get Method Bloods assets/get_method_bloods.mmd Get Method Bloods
Get Method Patient Annotations assets/get_method_pat_annotations.mmd Get Method Patient Annotations
Get Treatment Docs (No Terms Fuzzy) assets/get_treatment_docs_by_iterative_multi_term_cohort_searcher_no_terms_fuzzy.mmd Get Treatment Docs (No Terms Fuzzy)

Installation

From PyPI (Recommended for Users)

Once pat2vec is installed, you can use it as a library in your Python projects.

  1. Install the package:

    pip install pat2vec
  2. Install all optional dependencies (for full functionality):

    pip install pat2vec[all]

From Source (For Developers/Contributors)

The following instructions are for setting up a development environment from the source code.

Windows

  1. Clone the repository: Navigate to the directory where you want to store your projects. It's recommended to have a parent directory to hold pat2vec and its related assets.

    git clone https://github.com/SamoraHunter/pat2vec.git
  2. Run the installation script: Navigate into the cloned repository and run the batch script. This will create a Python virtual environment, install dependencies from requirements.txt, and set up a Jupyter kernel.

    cd pat2vec
    install.bat

    The script accepts several flags to customize the installation:

    • /p or /proxy: Use if you are behind a corporate proxy.
    • /dev: Installs development dependencies (e.g., pytest, sphinx).
    • /a or /all: Installs all optional feature dependencies.
    • /f or /force: Removes any existing virtual environment for a clean install.
    • /no-clone: Skips cloning the snomed_methods helper repository.
  3. Activate the environment: To use the installed packages, activate the virtual environment:

    pat2vec_env\Scripts\activate
  4. Set up for your IDE/Notebook: If you are using an IDE like VS Code or a Jupyter Notebook, make sure to select the pat2vec_env kernel to run your code.

  5. Post-Installation Setup: The script sets up the Python environment, but you must manually arrange other project assets. In the parent directory of your pat2vec clone, you will need to:

    • Clone the helper repository:
      git clone https://github.com/SamoraHunter/snomed_methods.git
    • Add MedCAT model: Create a medcat_models directory and copy your MedCAT model pack (.zip) into it.
    • Add credentials: Create a credentials.py file. You can use pat2vec/pat2vec/config/credentials_template.py as a starting point.

    Your final directory structure should look like the one described in the Usage section.

Unix/Linux

The install_pat2vec.sh script is the recommended way to set up a development environment on Unix-like systems. It automates the full setup, including:

  • Creating a Python virtual environment (pat2vec_env).
  • Installing Python dependencies (including development and testing tools).
  • Cloning the snomed_methods helper repository.
  • Creating required directories and template files (e.g., for MedCAT models and credentials).

To install, clone the repository, navigate into it, and run the script: Grant execution permissions and run the script. It must be run from within the pat2vec directory.

```shell
chmod +x install_pat2vec.sh
./install_pat2vec.sh
```

The script supports several options:
-   `--proxy`: Use if you are behind a corporate proxy that mirrors Python packages.
-   `--dev`: Installs development dependencies (e.g., `pytest`, `nbmake`) for running tests.
-   `--all`: Installs all optional feature dependencies.
-   `--force`: Removes any existing virtual environment and performs a clean installation.
-   `--no-clone`: Skips cloning the `snomed_methods` repository if you already have it.

For example, to install for development behind a proxy:
```shell
./install_pat2vec.sh --proxy --dev
```

After running the script, you must perform two manual steps: The script creates a directory structure in the parent folder of pat2vec. - Place MedCAT model: Copy your model pack into the medcat_models directory created by the script. - Populate credentials: Edit the credentials.py file created by the script and fill in your details.

Finally, activate the environment to begin working: shell source pat2vec_env/bin/activate

Usage

This guide outlines the steps to run a pat2vec analysis after completing the installation.

1. Finalise Project Setup

Before running an analysis, ensure your project directory is set up correctly. If you used the install_pat2vec.sh script, much of this is done for you.

  1. Populate credentials.py: In the parent directory of your pat2vec clone, edit credentials.py with your Elasticsearch credentials.
  2. Add MedCAT Model: Copy your MedCAT model pack (.zip) into the medcat_models directory.

Your final directory structure should look like this:

your_project_folder/
β”œβ”€β”€ credentials.py              # <-- Populated with your credentials
β”œβ”€β”€ medcat_models/
β”‚   └── your_model.zip          # <-- Your MedCAT model pack
β”œβ”€β”€ snomed_methods/             # <-- Cloned helper repository
└── pat2vec/                    # <-- This repository
    β”œβ”€β”€ notebooks/
    β”‚   └── example_usage.ipynb
    └── ...

2. Prepare Input Data

Create a CSV file containing your patient cohort. This file must include:

  • A column named client_idcode with unique patient identifiers.
  • Any other relevant columns, such as a diagnosis date for aligning time series data.

Place this file in an accessible location, such as a new data folder inside pat2vec/notebooks/.

3. Configure and Run

The example_usage.ipynb notebook provides a template for running the pipeline.

  1. Open the Notebook: Navigate to pat2vec/notebooks/ and open example_usage.ipynb.
  2. Select the Kernel: Ensure the pat2vec_env Jupyter kernel is active.
  3. Configure the Analysis: In the notebook, locate the config_class. This object controls all parameters for your run. You will need to set:
    • Paths to your input cohort CSV and output directories.
    • The list of features to extract.
    • Time windows for data extraction (look-back/look-forward periods).
  4. Run the Pipeline: Execute the cells in the notebook to process your data.

Note: When working with real patient data, ensure the testing flag in the config_class is set to False.

Building the Documentation

This project uses Sphinx to generate documentation from the source code's docstrings.

  1. Install development dependencies: If you haven't already, run the installation script with the --dev flag to install Sphinx and its extensions.

    ./install_pat2vec.sh --dev
  2. Activate the virtual environment:

    source pat2vec_env/bin/activate
  3. Build the HTML documentation: Navigate to the docs/ directory and use the provided Makefile.

    cd docs
    make html
  4. View the documentation: The generated files will be in docs/build/html/. You can open the main page in your browser:

    open docs/build/html/index.html
    

FAQ

For answers to common questions, troubleshooting tips, and more detailed explanations of project concepts, please see our Frequently Asked Questions page.

Citation

If you use pat2vec in your research, please cite it. This helps to credit the work and allows others to find the tool.

@software{hunter_pat2vec_2024,
  author = {Hunter, Samora},
  title = {pat2vec: A tool for transforming EHR data into feature vectors for machine learning},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SamoraHunter/pat2vec}}
}

Contributing

Contributions are welcome! Please see the contributing guidelines for more information.

Code of Conduct

This project and everyone participating in it is governed by a Code of Conduct. By participating, you are expected to uphold this code. Please report any unacceptable behavior.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

transform patient data to time interval feature vectors

Resources

Stars

Watchers

Forks

Packages

No packages published