Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jan 6, 2026

partially resolves #5663 and #5664

Overview

I decided to split out the prepro changes required for #5799 into 1 PR.

This change allows multiple references per segment, prepro will assign sequences to the correct reference within a segment and return the aligned (and unaligned) sequences with the key (of type: SequenceName) expected by the backend.

Prepro attempts to aligns each sequence to one reference per segment, if a sequence can be aligned to multiple references it chooses the reference with the highest nextclade alignment or nextclade sort score.

If multiple sequences within a submission align to the same segment (also if they align to different references of the same segment) the submission will error.

Changes

  1. The yaml config is changed to reflect segment-reference hierarchy (see breaking changes below), this config is used to create a list of processed NextcladeSequenceDataset objects for each reference.
  2. Improved typing, introduction of the SequenceName type (name of processed sequence as expected by the backend) to distinguish between SegmentName objects. For example if the segment L has references A and B, then the SegmentName is L but the SequenceName will be L_A.
  3. Removal of the useFirstSegment config option -perSegment metadata fields will always be assigned to results of the reference they best align to.
  4. The ASSIGNED_SEGMENT field is removed and replaced with ASSIGNED_REFERENCE -this is now a perSegment field.

Breaking changes

The prepro config must be changed from

configFile: 
   nextclade_sequence_and_datasets:
    - name: CV-A16 # This does not work yet with multi-segment organisms: https://github.com/loculus-project/loculus/issues/5663
      nextclade_dataset_name: enpen/enterovirus/cv-a16
      accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"]
      gene_prefix: "CV-A16-"
      genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
    - name: CV-A10
      nextclade_dataset_name: enpen/enterovirus/cv-a10
      accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"]
      gene_prefix: "CV-A10-"
      genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]

to (note we can add more segments with a variable number of references):

configFile: 
  segments:
    - name: main
      references: 
       -reference: CV-A16
        nextclade_dataset_name: enpen/enterovirus/cv-a16
        accepted_sort_matches: ["community/hodcroftlab/enterovirus/cva16", "community/hodcroftlab/enterovirus/enterovirus/linked/CV-A16"]
        genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
      - reference: CV-A10
        nextclade_dataset_name: enpen/enterovirus/cv-a10
        accepted_sort_matches: ["community/hodcroftlab/enterovirus/enterovirus/linked/CV-A10"]
        genes: ["VP4", "VP2", "VP3", "VP1", "2A", "2B", "2C", "3A", "3B", "3C", "3D"]
   - name: seg2
      genes: ...
      references: 
       -...

TODO:

  • check dummy pipeline -> confirmed this works without issues
  • rename segment to sequence when corresponding to name of segment/reference
  • change ASSIGNED_SEGMENT to ASSIGNED_REFERENCE and make this per segment
  • get rid of useFirstSegment
  • fix error for multiple segments of same type exist for multi reference case
  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?) -> confirmed that all current organisms are unaffected by this change including EVs, integration also all pass

🚀 Preview: https://prepro-multipath.loculus.org

@anna-parker anna-parker changed the title Prepro multipath feat!(prepro): multi-pathogen refactor, apply segment-reference ordering Jan 6, 2026
Base automatically changed from prepro_pydantic to main January 6, 2026 21:29
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Jan 7, 2026
@theosanderson theosanderson marked this pull request as ready for review January 7, 2026 14:05
@claude
Copy link
Contributor

claude bot commented Jan 7, 2026

PR Review: Multi-pathogen refactor with segment-reference ordering

Summary

This PR introduces a significant architectural change to support multiple references per segment in the preprocessing pipeline. The implementation is well-structured and handles the complexity of multi-reference/multi-segment scenarios effectively. However, there are several areas that need attention before merging.

Critical Issues

1. Exception Handling - config.py:170-172

def get_dataset_by_name(self, name: str) -> NextcladeSequenceAndDataset:
    datasets = [ds for ds in self.nextclade_sequence_and_datasets if ds.name == name]
    if len(datasets) > 1:
        raise Exception  # ❌ Generic exception
    return datasets[0]  # ❌ IndexError if list is empty

Issues:

  • Raises generic Exception instead of a specific exception type
  • Will raise IndexError if no datasets match (empty list)
  • No error message provided

Recommendation:

def get_dataset_by_name(self, name: str) -> NextcladeSequenceAndDataset:
    datasets = [ds for ds in self.nextclade_sequence_and_datasets if ds.name == name]
    if len(datasets) == 0:
        raise ValueError(f"No dataset found with name: {name}")
    if len(datasets) > 1:
        raise ValueError(f"Multiple datasets found with name: {name}, expected exactly one")
    return datasets[0]

2. Type Safety - config.py:159

references: list[Reference] | list[None] = segment.references or [None]

Issue: The type annotation list[Reference] | list[None] is misleading. This actually creates a list containing None elements, not an empty list.

Recommendation:

# More explicit approach
references = segment.references if segment.references else [None]
# Or with better typing
references: list[Reference | None] = segment.references if segment.references else [None]

3. Missing Documentation - Breaking Changes

The PR description mentions breaking changes to the config format, but there's no migration guide or deprecation warning in the code. Users upgrading will encounter runtime errors without clear guidance.

Recommendation:

  • Add a validation check in Config.finalize() that detects the old config format and provides a helpful error message with migration instructions
  • Consider adding a migration script in maintenance-scripts/

Code Quality Issues

4. Incomplete TODO - config.py:151

# TODO: this should be a suffix in future
ds.gene_prefix = ds.reference if multi_reference else None

Concern: This is marked as a future change but directly impacts the current implementation. The comment suggests the current behavior (prefix) is not ideal, but there's no issue filed to track this.

Recommendation:

  • Either implement as suffix now (if it's the correct design), or
  • File a GitHub issue to track this technical debt and reference it in the comment

5. Complex Naming Logic - config.py:175-187

The set_sequence_name function has complex logic with multiple cases. While the match statement is clear, the different naming conventions could be confusing.

Example outputs:

  • Single segment, multi-reference: "CV-A16" (just reference name)
  • Multi-segment, multi-reference: "L-A" (segment-reference)
  • Single segment, single reference: "main" (just segment name)

Recommendation:

  • Add comprehensive docstring with examples for each case
  • Consider adding unit tests specifically for this function

6. Gene Prefix/Suffix Confusion

In nextclade.py:90-91:

def create_gene_name(gene: str, gene_prefix: str | None) -> str:
    return gene_prefix + "-" + gene if gene_prefix else gene

This adds the reference name as a prefix with a dash, but the parameter is called gene_prefix and the TODO says it should be a suffix. This naming is confusing.

Current output: "CV-A16-VP4" (reference-gene)
If it were a suffix: "VP4-CV-A16" (gene-reference)

Recommendation:

  • Clarify the intended design and update naming accordingly
  • Ensure consistency across the codebase

Best Practices & Improvements

7. Code Duplication in Tests

The test config files have been updated correctly, but there's potential for helper functions to reduce duplication in test setup.

8. Schema Validation - values.schema.json

The JSON schema has been updated appropriately, but the additionalProperties: false on line 634 might be too strict if future fields need to be added to Reference.

Recommendation: Consider using JSON Schema $ref for better maintainability.

9. Logging Improvements - nextclade.py

Good use of debug logging throughout, but some error cases could benefit from more context:

  • Line 193: When nextclade sort fails, log the command that failed
  • Line 749: When nextclade run fails, capture and log stderr

Security Considerations

10. Command Injection Prevention

Good use of list-form subprocess calls throughout (e.g., lines 167-186 in nextclade.py). However, ensure user-provided config values (dataset names, URLs) are validated before being passed to subprocess.

Current state: ✅ Safe - using list form correctly
Recommendation: Add validation for config values that are URL-like or could contain shell metacharacters

Performance Considerations

11. Sequential Processing

In enrich_with_nextclade (nextclade.py:642-785), sequences are processed sequentially for each dataset. For organisms with many references, this could be slow.

Recommendation: Consider parallelizing nextclade runs for different datasets (if system resources allow).

12. Dictionary Lookups

The get_dataset_by_name method uses list comprehension every time it's called. For frequent calls, consider caching or using a dictionary.

Recommendation:

@property
def dataset_by_name(self) -> dict[str, NextcladeSequenceAndDataset]:
    return {ds.name: ds for ds in self.nextclade_sequence_and_datasets}

Test Coverage

Good: Tests have been updated for the new config format
Good: Multi-pathogen config test case added
Missing:

  • Tests for error cases (duplicate dataset names, missing datasets)
  • Tests for the new ASSIGNED_REFERENCE metadata field
  • Edge cases: empty segments list, references list
  • Tests for get_dataset_by_name failure modes

Recommendation: Add tests for error paths and edge cases before merging.

Documentation

13. Updated Docs

✅ The PR updates:

  • backend/docs/organismWithSuborganisms.md - Shows new config format
  • Helm chart config docs
  • Values schema

Missing:

  • Migration guide for existing deployments
  • Explanation of when to use multi-reference vs multi-segment
  • Performance implications of multi-reference setup

Breaking Changes Validation

The breaking changes are well-documented in the PR description. However:

  1. Template Changes: The removal of useFirstSegment in _preprocessingFromValues.tpl is good, but ensure all Helm chart users are aware
  2. Metadata Field Change: ASSIGNED_SEGMENTASSIGNED_REFERENCE - this affects any downstream tools using this field

Recommendation:

  • Add a breaking change notice to CHANGELOG
  • Consider adding a deprecation period if feasible

Positive Aspects

Excellent:

  • Clean separation of SegmentName vs SequenceName type aliases
  • Well-structured config hierarchy (Segment → Reference)
  • Comprehensive test config updates
  • Good use of type hints throughout
  • Proper handling of alignment failures for different references

Summary of Recommendations

Must Fix Before Merge:

  1. Fix exception handling in get_dataset_by_name (Issue #1)
  2. Add test coverage for error paths
  3. Add migration guide for config format changes

Should Fix Before Merge:
4. Clarify gene prefix/suffix design and resolve TODO
5. Improve type annotation for references list (Issue #2)
6. Add validation for old config format with helpful error message

Nice to Have:
7. Add caching for get_dataset_by_name
8. Enhanced logging for subprocess failures
9. Consider parallelization for multi-dataset processing

Verdict

This is a well-architected change that properly handles the complexity of multi-pathogen support. The core logic is sound and the config structure makes sense. However, the critical issues around exception handling and the missing migration guidance need to be addressed before merging.

Recommendation: Request changes - address critical issues #1-3 before merge.


Great work on tackling this complex refactor! The segment-reference hierarchy is much cleaner than the previous approach. 🚀

- reference: ebola-zaire
nextclade_dataset_name: ebola-dataset/ebola-zaire
accepted_sort_matches: [ebola-zaire]
genes: [VP24EbolaZaire, LEbolaZaire]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this is ugly - have we figured out the pathway forward on this (I haven't!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... which parts do you find ugly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accepted_sort_matches: [ebola-zaire] is actually not required (it is a default value), also the gene names VP24EbolaZaire are just custom for this dataset (for loculus the gene name will be VP24) so that I can ensure my algorithm is using the right dataset for genes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it was the VP24EbolaZaire - ok sounds good - maybe we should comment that somewhere (or maybe we already do)

"groups": ["nextcladeSegment"],
"docsIncludePrefix": false,
"type": "string",
"description": "Name of the reference to use for alignment - defaults to 'singleReference'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not default to anything and should force users to supply this name (which could be singleReference).

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - yes - sorry! we can make it required

segments:
- name: main
references:
- reference: CV-A16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- reference: CV-A16
- reference_name: CV-A16

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and below

Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with this code, so I'm not super confident. But I've read through and spotted some issues, and those have been resolved and I don't personally see others. And I've clicked a bit and the fact that the preview still seems to work well is encouraging! Thanks so much for the work. Splitting it out in this way definitely makes sense. If anyone else wants to review that's welcome too.

@anna-parker
Copy link
Contributor Author

@theosanderson thanks! I'm currently adding tests for a multi-reference, multi-segment case - I will wait till those are configured before merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi Path: Multi Segment - issue with ASSIGNED_SEGMENT

3 participants