feat!(prepro): multi-pathogen refactor, apply segment-reference ordering #5800

anna-parker · 2026-01-06T11:31:00Z

partially resolves #5663 and #5664

Overview

I decided to split out the prepro changes required for #5799 into 1 PR.

This change allows multiple references per segment, prepro will assign sequences to the correct reference within a segment and return the aligned (and unaligned) sequences with the key (of type: SequenceName) expected by the backend.

Prepro attempts to aligns each sequence to one reference per segment, if a sequence can be aligned to multiple references it chooses the reference with the highest nextclade alignment or nextclade sort score.

If multiple sequences within a submission align to the same segment (also if they align to different references of the same segment) the submission will error.

Changes

The yaml config is changed to reflect segment-reference hierarchy (see breaking changes below), this config is used to create a list of processed NextcladeSequenceDataset objects for each reference.
Improved typing, introduction of the SequenceName type (name of processed sequence as expected by the backend) to distinguish between SegmentName objects. For example if the segment L has references A and B, then the SegmentName is L but the SequenceName will be L_A.
Removal of the useFirstSegment config option -perSegment metadata fields will always be assigned to results of the reference they best align to.
The ASSIGNED_SEGMENT field is removed and replaced with ASSIGNED_REFERENCE -this is now a perSegment field.

Breaking changes

The prepro config must be changed from

configFile: 
   nextclade_sequence_and_datasets:
    - name: CV-A16 # This does not work yet with multi-segment organisms: https://github.com/loculus-project/loculus/issues/5663
      nextclade_dataset_name: enpen/enterovirus/cv-a16
      accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"]
      gene_prefix: "CV-A16-"
      genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
    - name: CV-A10
      nextclade_dataset_name: enpen/enterovirus/cv-a10
      accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"]
      gene_prefix: "CV-A10-"
      genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]

to (note we can add more segments with a variable number of references):

configFile: 
  segments:
    - name: main
      references: 
       -reference: CV-A16
        nextclade_dataset_name: enpen/enterovirus/cv-a16
        accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"]
        genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
      - reference: CV-A10
        nextclade_dataset_name: enpen/enterovirus/cv-a10
        accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"]
        genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
   - name: seg2
      genes: ...
      references: 
       -...

TODO:

check dummy pipeline -> confirmed this works without issues
rename segment to sequence when corresponding to name of segment/reference
change ASSIGNED_SEGMENT to ASSIGNED_REFERENCE and make this per segment
get rid of useFirstSegment
fix error for multiple segments of same type exist for multi reference case
All necessary documentation has been adapted.
The implemented feature is covered by appropriate, automated tests.
Any manual testing that has been done is documented (i.e. what exactly was tested?) -> confirmed that all current organisms are unaffected by this change including EVs, integration also all pass

🚀 Preview: https://prepro-multipath.loculus.org

…ences align to the same segment

…e segment to sequence when mislabeled

claude · 2026-01-07T14:06:56Z

PR Review: Multi-pathogen refactor with segment-reference ordering

Summary

This PR introduces a significant architectural change to support multiple references per segment in the preprocessing pipeline. The implementation is well-structured and handles the complexity of multi-reference/multi-segment scenarios effectively. However, there are several areas that need attention before merging.

Critical Issues

1. Exception Handling - config.py:170-172

def get_dataset_by_name(self, name: str) -> NextcladeSequenceAndDataset:
    datasets = [ds for ds in self.nextclade_sequence_and_datasets if ds.name == name]
    if len(datasets) > 1:
        raise Exception  # ❌ Generic exception
    return datasets[0]  # ❌ IndexError if list is empty

Issues:

Raises generic Exception instead of a specific exception type
Will raise IndexError if no datasets match (empty list)
No error message provided

Recommendation:

def get_dataset_by_name(self, name: str) -> NextcladeSequenceAndDataset:
    datasets = [ds for ds in self.nextclade_sequence_and_datasets if ds.name == name]
    if len(datasets) == 0:
        raise ValueError(f"No dataset found with name: {name}")
    if len(datasets) > 1:
        raise ValueError(f"Multiple datasets found with name: {name}, expected exactly one")
    return datasets[0]

2. Type Safety - config.py:159

references: list[Reference] | list[None] = segment.references or [None]

Issue: The type annotation list[Reference] | list[None] is misleading. This actually creates a list containing None elements, not an empty list.

Recommendation:

# More explicit approach
references = segment.references if segment.references else [None]
# Or with better typing
references: list[Reference | None] = segment.references if segment.references else [None]

3. Missing Documentation - Breaking Changes

The PR description mentions breaking changes to the config format, but there's no migration guide or deprecation warning in the code. Users upgrading will encounter runtime errors without clear guidance.

Recommendation:

Add a validation check in Config.finalize() that detects the old config format and provides a helpful error message with migration instructions
Consider adding a migration script in maintenance-scripts/

Code Quality Issues

4. Incomplete TODO - config.py:151

# TODO: this should be a suffix in future
ds.gene_prefix = ds.reference if multi_reference else None

Concern: This is marked as a future change but directly impacts the current implementation. The comment suggests the current behavior (prefix) is not ideal, but there's no issue filed to track this.

Recommendation:

Either implement as suffix now (if it's the correct design), or
File a GitHub issue to track this technical debt and reference it in the comment

5. Complex Naming Logic - config.py:175-187

The set_sequence_name function has complex logic with multiple cases. While the match statement is clear, the different naming conventions could be confusing.

Example outputs:

Single segment, multi-reference: "CV-A16" (just reference name)
Multi-segment, multi-reference: "L-A" (segment-reference)
Single segment, single reference: "main" (just segment name)

Recommendation:

Add comprehensive docstring with examples for each case
Consider adding unit tests specifically for this function

6. Gene Prefix/Suffix Confusion

In nextclade.py:90-91:

def create_gene_name(gene: str, gene_prefix: str | None) -> str:
    return gene_prefix + "-" + gene if gene_prefix else gene

This adds the reference name as a prefix with a dash, but the parameter is called gene_prefix and the TODO says it should be a suffix. This naming is confusing.

Current output: "CV-A16-VP4" (reference-gene)
If it were a suffix: "VP4-CV-A16" (gene-reference)

Recommendation:

Clarify the intended design and update naming accordingly
Ensure consistency across the codebase

Best Practices & Improvements

7. Code Duplication in Tests

The test config files have been updated correctly, but there's potential for helper functions to reduce duplication in test setup.

8. Schema Validation - values.schema.json

The JSON schema has been updated appropriately, but the additionalProperties: false on line 634 might be too strict if future fields need to be added to Reference.

Recommendation: Consider using JSON Schema $ref for better maintainability.

9. Logging Improvements - nextclade.py

Good use of debug logging throughout, but some error cases could benefit from more context:

Line 193: When nextclade sort fails, log the command that failed
Line 749: When nextclade run fails, capture and log stderr

Security Considerations

10. Command Injection Prevention

Good use of list-form subprocess calls throughout (e.g., lines 167-186 in nextclade.py). However, ensure user-provided config values (dataset names, URLs) are validated before being passed to subprocess.

Current state: ✅ Safe - using list form correctly
Recommendation: Add validation for config values that are URL-like or could contain shell metacharacters

Performance Considerations

11. Sequential Processing

In enrich_with_nextclade (nextclade.py:642-785), sequences are processed sequentially for each dataset. For organisms with many references, this could be slow.

Recommendation: Consider parallelizing nextclade runs for different datasets (if system resources allow).

12. Dictionary Lookups

The get_dataset_by_name method uses list comprehension every time it's called. For frequent calls, consider caching or using a dictionary.

Recommendation:

@property
def dataset_by_name(self) -> dict[str, NextcladeSequenceAndDataset]:
    return {ds.name: ds for ds in self.nextclade_sequence_and_datasets}

Test Coverage

✅ Good: Tests have been updated for the new config format
✅ Good: Multi-pathogen config test case added
❌ Missing:

Tests for error cases (duplicate dataset names, missing datasets)
Tests for the new ASSIGNED_REFERENCE metadata field
Edge cases: empty segments list, references list
Tests for get_dataset_by_name failure modes

Recommendation: Add tests for error paths and edge cases before merging.

Documentation

13. Updated Docs

✅ The PR updates:

backend/docs/organismWithSuborganisms.md - Shows new config format
Helm chart config docs
Values schema

❌ Missing:

Migration guide for existing deployments
Explanation of when to use multi-reference vs multi-segment
Performance implications of multi-reference setup

Breaking Changes Validation

The breaking changes are well-documented in the PR description. However:

Template Changes: The removal of useFirstSegment in _preprocessingFromValues.tpl is good, but ensure all Helm chart users are aware
Metadata Field Change: ASSIGNED_SEGMENT → ASSIGNED_REFERENCE - this affects any downstream tools using this field

Recommendation:

Add a breaking change notice to CHANGELOG
Consider adding a deprecation period if feasible

Positive Aspects

✨ Excellent:

Clean separation of SegmentName vs SequenceName type aliases
Well-structured config hierarchy (Segment → Reference)
Comprehensive test config updates
Good use of type hints throughout
Proper handling of alignment failures for different references

Summary of Recommendations

Must Fix Before Merge:

Fix exception handling in get_dataset_by_name (Issue #1)
Add test coverage for error paths
Add migration guide for config format changes

Should Fix Before Merge:
4. Clarify gene prefix/suffix design and resolve TODO
5. Improve type annotation for references list (Issue #2)
6. Add validation for old config format with helpful error message

Nice to Have:
7. Add caching for get_dataset_by_name
8. Enhanced logging for subprocess failures
9. Consider parallelization for multi-dataset processing

Verdict

This is a well-architected change that properly handles the complexity of multi-pathogen support. The core logic is sound and the config structure makes sense. However, the critical issues around exception handling and the missing migration guidance need to be addressed before merging.

Recommendation: Request changes - address critical issues #1-3 before merge.

Great work on tackling this complex refactor! The segment-reference hierarchy is much cleaner than the previous approach. 🚀

backend/docs/organismWithSuborganisms.md

kubernetes/loculus/values.schema.json

preprocessing/nextclade/src/loculus_preprocessing/config.py

theosanderson · 2026-01-07T14:34:56Z

preprocessing/nextclade/tests/multi_pathogen_config.yaml

+    - reference: ebola-zaire
+      nextclade_dataset_name: ebola-dataset/ebola-zaire
+      accepted_sort_matches: [ebola-zaire]
+      genes: [VP24EbolaZaire, LEbolaZaire]


Huh, this is ugly - have we figured out the pathway forward on this (I haven't!)

hmm... which parts do you find ugly?

accepted_sort_matches: [ebola-zaire] is actually not required (it is a default value), also the gene names VP24EbolaZaire are just custom for this dataset (for loculus the gene name will be VP24) so that I can ensure my algorithm is using the right dataset for genes

ah it was the VP24EbolaZaire - ok sounds good - maybe we should comment that somewhere (or maybe we already do)

theosanderson · 2026-01-07T14:53:41Z

kubernetes/loculus/values.schema.json

+                              "groups": ["nextcladeSegment"],
+                              "docsIncludePrefix": false,
+                              "type": "string",
+                              "description": "Name of the reference to use for alignment - defaults to 'singleReference'."


IMO we should not default to anything and should force users to supply this name (which could be singleReference).

ah - yes - sorry! we can make it required

theosanderson · 2026-01-07T15:26:16Z

backend/docs/organismWithSuborganisms.md

+          segments:
+            - name: main
+              references:
+              - reference: CV-A16


Suggested change

- reference: CV-A16

- reference_name: CV-A16

theosanderson

I'm not very familiar with this code, so I'm not super confident. But I've read through and spotted some issues, and those have been resolved and I don't personally see others. And I've clicked a bit and the fact that the preview still seems to work well is encouraging! Thanks so much for the work. Splitting it out in this way definitely makes sense. If anyone else wants to review that's welcome too.

anna-parker · 2026-01-07T16:58:22Z

@theosanderson thanks! I'm currently adding tests for a multi-reference, multi-segment case - I will wait till those are configured before merging

anna-parker changed the title ~~Prepro multipath~~ feat!(prepro): multi-pathogen refactor, apply segment-reference ordering Jan 6, 2026

Base automatically changed from prepro_pydantic to main January 6, 2026 21:29

anna-parker added 14 commits January 7, 2026 09:27

use pydantic for the config

a39e43c

testing

c29524a

fixup

9ccd8d0

feat(prepro): set order to segment/reference

21335ab

update schema

32f85d2

fix tests

9425ea7

fixup

ad3d541

fix more tests

f4d3675

lint

fa0c763

fix

5862be3

fix error message for multi-segment, multi-reference if multiple sequ…

76bcc19

…ences align to the same segment

fixup

73e7baf

update config with suggestions from slack

0903ed5

correct use of ASSIGNED_SEGMENT, get rid of useFirstSegment and renam…

1b1c121

…e segment to sequence when mislabeled

anna-parker force-pushed the prepro_multipath branch from 30dfd17 to 1b1c121 Compare January 7, 2026 08:29

anna-parker added 4 commits January 7, 2026 09:35

fix function

d47e63a

actually this is correct

d38d09c

format

fa1b2c7

typing

a77f198

anna-parker added the preview Triggers a deployment to argocd label Jan 7, 2026

anna-parker added 2 commits January 7, 2026 09:45

fix silly mypy warning

1e12d8c

make ASSIGNED_REFERENCE per segment

69c2ddf

anna-parker requested review from corneliusroemer and theosanderson January 7, 2026 10:32

theosanderson marked this pull request as ready for review January 7, 2026 14:05