This repository contains the code to reproduce the analyses from the paper: Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction.
The pipeline integrates Named Entity Recognition (NER), Relation Extraction (RE), and XGBoost-based phenotype prediction using Snakemake workflows.
- Python ≥3.8
- Snakemake ≥6.0
- Mamba or Conda ≥4.9
- CUDA-capable GPU (recommended for training)
- ~50GB free disk space for full pipeline
- 16GB+ RAM recommended
- NCBI API key (optional, speeds up genome downloads)
- InterProScan ≥5.0 (for protein annotation)
- Download and extract to your system
- Update the path in
ip.smkat line containinginterproscan.sh
Before running any pipeline, adjust config.yaml to match your setup.
| Parameter | Description | Default/Example |
|---|---|---|
dataset |
Corpus identifier (determines output directories) | 1108 |
cuda_devices |
GPU devices for training | [0] |
input_file |
Path to manually annotated training data | label/project-10-at-2025-08-21-21-08-cb43bf25.json |
ner_epochs |
Training epochs for NER models | 15 |
rel_epochs |
Training epochs for RE models | 25 |
ner_test |
Test split ratio for NER | 0.2 |
rel_test |
Test split ratio for RE | 0.3 |
cutoff_prediction |
Confidence threshold for predictions | 0.50 |
model |
Model size (base/large) | "large" |
seed |
Random seed for reproducibility | 97 |
output_path |
Base output directory | /pfs/work9/workspace/scratch/tu_kmpaj01-link |
pmc_parquet_file |
PMC corpus data file | snakemake_PMC/output/data/pmc_filtered.parquet |
ner_labels: [STRAIN, SPECIES, ISOLATE, COMPOUND, MEDIUM, ORGANISM, PHENOTYPE, EFFECT, DISEASE]rel_labels:
- STRAIN-ISOLATE:INHABITS
- STRAIN-MEDIUM:GROWS_ON
- STRAIN-PHENOTYPE:PRESENTS
- STRAIN-ORGANISM:INHABITS
- STRAIN-COMPOUND:RESISTS
# ... (16 total relationship types)Create the main environments:
# Create primary environment
mamba env create -f envs/nlp4pheno.yml
# Create PMC corpus processing environment
mamba env create -f envs/pmc.ymlNote: Additional environments (pytorch.yml, xgb.yml) are created automatically by Snakemake when needed.
Activate the main environment for running Snakemake:
conda activate nlp4phenoTest your setup:
# Check Snakemake
snakemake --version
# Check available environments
conda env list
# Test configuration parsing
snakemake -n -s ner.smk- Use the code in
snakemake_PMC/to download files (requires pmc environment):
conda activate pmc
snakemake --cores 20 --use-conda --executor slurm -s snakemake_PMC/Snakefile- Prepare corpus files using the
scripts/make_test_corpus.pyscript. Files are saved tocorpus{dataset}/directory (where{dataset}is defined inconfig.yaml).
The manually annotated dataset is provided in label/project-10-at-2025-08-21-21-08-cb43bf25.json (Label Studio JSON format).
Format Requirements:
- Label Studio JSON export format
- Must contain annotations for all 9 entity types:
STRAIN,SPECIES,ISOLATE,COMPOUND,MEDIUM,ORGANISM,PHENOTYPE,EFFECT,DISEASE - Annotations should include entity spans and relationship labels
Corpus files are automatically generated from the PMC parquet data during the NER prediction step. The ner_pred.smk pipeline will:
- Read from
pmc_parquet_filespecified in config.yaml - Generate numbered text files (e.g.,
0.txt,1.txt) incorpus{dataset}/directory - Process files in chunks for efficient prediction
Train NER models for each entity type (STRAIN, SPECIES, PHENOTYPE, etc.):
snakemake --cores 20 --use-conda -s ner.smkRequirements: 8GB GPU memory recommended
Output:
NER/: Data splits for training and testingNER_output/: Trained models, metrics, and Nervaluate evaluation results
Train models to predict relationships between entities:
rm -rf REL*
snakemake --cores 20 --use-conda -s rel.smkRequirements: 8GB GPU memory recommended
Output:
REL/: Data for training and testingREL_output/: Trained models and metrics
Apply trained NER models to the PMC corpus:
snakemake --cores 20 --use-conda -s ner_pred.smkProcess: Runs STRAIN model on entire corpus, then applies other NER models on strain-containing sentences. Keeps sentences with both STRAIN and phenotype entities.
Requirements: GPU recommended
Output: Saved to directory specified in config.yaml
Apply trained RE models to extract relationships:
snakemake --cores 20 --use-conda -s rel_pred.smkProcess: Analyzes sentences containing STRAIN and phenotype entities to extract relationships.
Requirements: GPU recommended
- Optionally create a
.ncbi_api_keyfile in the root directory with your NCBI API key (recommended to speed up downloads) - Adjust InterProScan installation path in the pipeline
- Run genome download and annotation:
snakemake --cores 20 --use-conda -s ip.smkOutput will be saved to assemblies_{dataset}/ directory.
Run XGBoost models for phenotype prediction based on protein domains:
snakemake --cores 40 --use-conda -s xgboost.smkOutput:
xgboost/annotations{dataset}/binary/binary.pkl: Main results filexgboost/seqfiles{dataset}/: Sequences for evolution analysisxgboost/features{dataset}/: Feature matrices and importance scores- Analysis notebook:
analyze_xgboost_tidy.ipynb
NER/ # Training data splits
├── {ENTITY}/
│ ├── train.json # Training set
│ ├── dev.json # Validation set
│ └── test.json # Test set
NER_output/ # Model outputs
├── {ENTITY}/
│ ├── model.safetensors # Trained model
│ ├── all_results.json # Training metrics
│ ├── overall_results.json # Nervaluate results
│ └── test_predictions.txt # Test predictions
└── aggregated_eval.tsv # Combined metrics
REL/ # Training data
├── {RELATION}/
│ ├── train.json
│ ├── dev.json
│ └── test.json
REL_output/ # Model outputs
├── {RELATION}/
│ ├── model.safetensors
│ └── all_results.json
└── all_metrics.tsv
corpus{dataset}/ # Input corpus files
preds{dataset}/ # NER/RE predictions
├── NER_output/
└── REL_output/
assemblies_{dataset}/ # Genome data
├── {strain}/
│ ├── genomic.fna # Genome sequence
│ └── protein.faa # Protein sequences
xgboost/ # ML outputs
├── annotations{dataset}/
├── features{dataset}/
└── seqfiles{dataset}/
Analyze selective pressure on important protein domains:
snakemake --cores 20 --use-conda -s evolution.smkAnalysis notebook: analyze_evolution.ipynb
PMC Corpus Creation
│
├── Manual Annotations (Label Studio)
│ │
│ ├── 1. NER Training (ner.smk) ────────────┐
│ │ │
│ └── 2. RE Training (rel.smk) │
│ │ │
│ └── 3. NER Prediction (ner_pred.smk) ─┴─┐
│ │ │
│ └── 4. RE Prediction (rel_pred.smk) │
│ │ │
│ ├── 5. Genome Download & Annotation (ip.smk)
│ │ │
│ └── 6. XGBoost Phenotype Prediction (xgboost.smk)
│ │
│ └── 7. Evolution Analysis (evolution.smk)
Dependencies:
- Steps 3-4: Require trained models from steps 1-2
- Step 5: Can run independently after step 4
- Step 6: Requires outputs from steps 4-5
- Step 7: Requires outputs from step 6
If you use this pipeline, please cite:
@article{nlp4pheno2024,
title={Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction},
author={[Authors]},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.12.07.627346},
url={https://doi.org/10.1101/2024.12.07.627346}
}This project is licensed under the MIT License. See the LICENSE file for details.