-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Describe the bug
When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.
This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which
- Checks
sample_data_typeto see if the field's name has been seen before - If it has NOT been seen before, creates a new entry with
data_key = "{}__{}".format(section, d_key).
But in step (1) it will reuse any key matching d_key, even if section does not match.
To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
multiqc_config.yaml
custom_data:
Pipeline_A_Result:
file_format: "csv"
Pipeline_B_Result:
file_format: "csv"
sp:
Pipeline_A_Result:
fn: "*A_report.csv"
Pipeline_B_Result:
fn: "*B_report.csv"
A_report.csv (generated by Pipeline A)
sample_id,patient_id,variant_count
sample_1,patient_1,10
B_report.csv (generated by Pipeline B)
sample_id,patient_id,pvalue
sample_2,patient_2,0.0001
Steps:
- Run pipeline A and submit its data to megaqc,
- Run pipeline B and submit its data to megaqc
megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.
Specifically, the sample_data and sample_data_type tables will look like
sample_data_type
| sample_data_type_id | data_id | data_section | data_key | schema |
|---|---|---|---|---|
| 0 | patient_id | Pipeline_A_Result-plot | Pipeline_A_Result-plot__patient_id | null |
| 1 | variant_count | Pipeline_A_Result-plot | Pipeline_A_Result-plot__variant_count | null |
| 2 | pvalue | Pipeline_B_Result-plot | Pipeline_B_Result-plot__pvalue | null |
sample_data
| sample_data_id | report_id | sample_data_type_id | sample_id | value |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | patient_1 |
| 1 | 0 | 1 | 0 | 10 |
| 2 | 1 | 0 (*) | 1 | patient_2 |
| 3 | 1 | 2 | 1 | 0.0001 |
* NOTE:
sample_data_type_id=0refers todata_section=Pipeline_A_Result-plot, even though this value actually came fromPipeline_B.
Expected behavior
data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'
| sample_data_type_id | data_id | data_section | data_key | schema |
|---|---|---|---|---|
| 0 | patient_id | Pipeline_A_Result-plot | Pipeline_A_Result-plot__patient_id | null |
| 1 | variant_count | Pipeline_A_Result-plot | Pipeline_A_Result-plot__variant_count | null |
| 2 | patient_id | Pipeline_B_Result-plot | Pipeline_B_Result-plot__patient_id | null |
| 3 | pvalue | Pipeline_B_Result-plot | Pipeline_B_Result-plot__pvalue | null |
System
- MegaQC: 0.3.0