-
Notifications
You must be signed in to change notification settings - Fork 9
feat!(prepro): multi-pathogen refactor, apply segment-reference ordering #5800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ences align to the same segment
…e segment to sequence when mislabeled
30dfd17 to
1b1c121
Compare
PR Review: Multi-pathogen refactor with segment-reference orderingSummaryThis PR introduces a significant architectural change to support multiple references per segment in the preprocessing pipeline. The implementation is well-structured and handles the complexity of multi-reference/multi-segment scenarios effectively. However, there are several areas that need attention before merging. Critical Issues1. Exception Handling - config.py:170-172def get_dataset_by_name(self, name: str) -> NextcladeSequenceAndDataset:
datasets = [ds for ds in self.nextclade_sequence_and_datasets if ds.name == name]
if len(datasets) > 1:
raise Exception # ❌ Generic exception
return datasets[0] # ❌ IndexError if list is emptyIssues:
Recommendation: def get_dataset_by_name(self, name: str) -> NextcladeSequenceAndDataset:
datasets = [ds for ds in self.nextclade_sequence_and_datasets if ds.name == name]
if len(datasets) == 0:
raise ValueError(f"No dataset found with name: {name}")
if len(datasets) > 1:
raise ValueError(f"Multiple datasets found with name: {name}, expected exactly one")
return datasets[0]2. Type Safety - config.py:159references: list[Reference] | list[None] = segment.references or [None]Issue: The type annotation Recommendation: # More explicit approach
references = segment.references if segment.references else [None]
# Or with better typing
references: list[Reference | None] = segment.references if segment.references else [None]3. Missing Documentation - Breaking ChangesThe PR description mentions breaking changes to the config format, but there's no migration guide or deprecation warning in the code. Users upgrading will encounter runtime errors without clear guidance. Recommendation:
Code Quality Issues4. Incomplete TODO - config.py:151# TODO: this should be a suffix in future
ds.gene_prefix = ds.reference if multi_reference else NoneConcern: This is marked as a future change but directly impacts the current implementation. The comment suggests the current behavior (prefix) is not ideal, but there's no issue filed to track this. Recommendation:
5. Complex Naming Logic - config.py:175-187The Example outputs:
Recommendation:
6. Gene Prefix/Suffix ConfusionIn def create_gene_name(gene: str, gene_prefix: str | None) -> str:
return gene_prefix + "-" + gene if gene_prefix else geneThis adds the reference name as a prefix with a dash, but the parameter is called Current output: Recommendation:
Best Practices & Improvements7. Code Duplication in TestsThe test config files have been updated correctly, but there's potential for helper functions to reduce duplication in test setup. 8. Schema Validation - values.schema.jsonThe JSON schema has been updated appropriately, but the Recommendation: Consider using JSON Schema 9. Logging Improvements - nextclade.pyGood use of debug logging throughout, but some error cases could benefit from more context:
Security Considerations10. Command Injection PreventionGood use of list-form subprocess calls throughout (e.g., lines 167-186 in nextclade.py). However, ensure user-provided config values (dataset names, URLs) are validated before being passed to subprocess. Current state: ✅ Safe - using list form correctly Performance Considerations11. Sequential ProcessingIn Recommendation: Consider parallelizing nextclade runs for different datasets (if system resources allow). 12. Dictionary LookupsThe Recommendation: @property
def dataset_by_name(self) -> dict[str, NextcladeSequenceAndDataset]:
return {ds.name: ds for ds in self.nextclade_sequence_and_datasets}Test Coverage✅ Good: Tests have been updated for the new config format
Recommendation: Add tests for error paths and edge cases before merging. Documentation13. Updated Docs✅ The PR updates:
❌ Missing:
Breaking Changes ValidationThe breaking changes are well-documented in the PR description. However:
Recommendation:
Positive Aspects✨ Excellent:
Summary of RecommendationsMust Fix Before Merge:
Should Fix Before Merge: Nice to Have: VerdictThis is a well-architected change that properly handles the complexity of multi-pathogen support. The core logic is sound and the config structure makes sense. However, the critical issues around exception handling and the missing migration guidance need to be addressed before merging. Recommendation: Request changes - address critical issues #1-3 before merge. Great work on tackling this complex refactor! The segment-reference hierarchy is much cleaner than the previous approach. 🚀 |
| - reference: ebola-zaire | ||
| nextclade_dataset_name: ebola-dataset/ebola-zaire | ||
| accepted_sort_matches: [ebola-zaire] | ||
| genes: [VP24EbolaZaire, LEbolaZaire] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, this is ugly - have we figured out the pathway forward on this (I haven't!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... which parts do you find ugly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted_sort_matches: [ebola-zaire] is actually not required (it is a default value), also the gene names VP24EbolaZaire are just custom for this dataset (for loculus the gene name will be VP24) so that I can ensure my algorithm is using the right dataset for genes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah it was the VP24EbolaZaire - ok sounds good - maybe we should comment that somewhere (or maybe we already do)
| "groups": ["nextcladeSegment"], | ||
| "docsIncludePrefix": false, | ||
| "type": "string", | ||
| "description": "Name of the reference to use for alignment - defaults to 'singleReference'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah - yes - sorry! we can make it required
| segments: | ||
| - name: main | ||
| references: | ||
| - reference: CV-A16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - reference: CV-A16 | |
| - reference_name: CV-A16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and below
theosanderson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with this code, so I'm not super confident. But I've read through and spotted some issues, and those have been resolved and I don't personally see others. And I've clicked a bit and the fact that the preview still seems to work well is encouraging! Thanks so much for the work. Splitting it out in this way definitely makes sense. If anyone else wants to review that's welcome too.
|
@theosanderson thanks! I'm currently adding tests for a multi-reference, multi-segment case - I will wait till those are configured before merging |

partially resolves #5663 and #5664
Overview
I decided to split out the prepro changes required for #5799 into 1 PR.
This change allows multiple
referencespersegment, prepro will assign sequences to the correct reference within a segment and return the aligned (and unaligned) sequences with the key (of type:SequenceName) expected by the backend.Prepro attempts to aligns each sequence to one reference per segment, if a sequence can be aligned to multiple references it chooses the reference with the highest
nextclade alignmentornextclade sortscore.If multiple sequences within a submission align to the same segment (also if they align to different references of the same segment) the submission will error.
Changes
NextcladeSequenceDatasetobjects for each reference.SequenceNametype (name of processed sequence as expected by the backend) to distinguish betweenSegmentNameobjects. For example if the segmentLhas referencesAandB, then theSegmentNameisLbut theSequenceNamewill beL_A.useFirstSegmentconfig option -perSegmentmetadata fields will always be assigned to results of the reference they best align to.ASSIGNED_SEGMENTfield is removed and replaced withASSIGNED_REFERENCE-this is now aperSegmentfield.Breaking changes
The prepro config must be changed from
to (note we can add more segments with a variable number of references):
TODO:
segment/referenceASSIGNED_SEGMENTtoASSIGNED_REFERENCEand make this per segmentuseFirstSegment🚀 Preview: https://prepro-multipath.loculus.org