Bug report - sample_data sometimes references incorrect sample_data_type entries

**Describe the bug**

When handling a report, megaqc loops over each data value and checks to see if that `SampleDataType` already exists. However it only checks on the basis of `data_id`, but ignores `data_section`. Therefore if multiple report types (data sections) reuse the same `data_id`, currently this will reuse that `SampleDataType` even if `data_section` is wrong for the incoming report.

**This becomes problematic if you want to query for historic results based on data_section.**

This is due to [this code](https://github.com/MultiQC/MegaQC/blob/8d287d6e2bc46b0fbad2a0436b0769005a18e828/megaqc/api/utils.py#L159-L171), which
1) Checks `sample_data_type` to see if the field's name has been seen before
2) If it has NOT been seen before, creates a new entry with `data_key = "{}__{}".format(section, d_key)`.

But in step (1) it will reuse any key matching `d_key`, even if `section` does not match.

**To Reproduce**

Here is a barebones multiqc_config and set of report files that can reveal the issue.

`multiqc_config.yaml`
```yaml
custom_data:
  Pipeline_A_Result:
    file_format: "csv"
  Pipeline_B_Result:
    file_format: "csv"
sp:
  Pipeline_A_Result:
    fn: "*A_report.csv"
  Pipeline_B_Result:
    fn: "*B_report.csv"

```

`A_report.csv` (generated by Pipeline A)
```csv
sample_id,patient_id,variant_count
sample_1,patient_1,10
```

`B_report.csv` (generated by Pipeline B)
```csv
sample_id,patient_id,pvalue
sample_2,patient_2,0.0001
```

Steps:
1) Run pipeline A and submit its data to megaqc,
2) Run pipeline B and submit its data to megaqc

megaqc erroneously associates `patient_id` to only come from `Pipeline_A_Result`, even though in one case it comes from `Pipeline_B_Result`.

Specifically, the `sample_data` and `sample_data_type` tables will look like

`sample_data_type`

| sample_data_type_id | data_id | data_section | data_key | schema |
| --------------------- | -------- | ------------- | --------- | -------- |
| 0 | patient_id | Pipeline_A_Result-plot | Pipeline_A_Result-plot__patient_id | null |
| 1 | variant_count | Pipeline_A_Result-plot | Pipeline_A_Result-plot__variant_count | null |
| 2 | pvalue | Pipeline_B_Result-plot | Pipeline_B_Result-plot__pvalue | null |

`sample_data`

| sample_data_id | report_id | sample_data_type_id | sample_id | value |
| ---------------- | --------- | --------------------- | ---------- | ----- |
| 0 | 0 | 0 | 0 | patient_1 |
| 1 | 0 | 1 | 0 | 10 |
| 2 | 1 | 0 (*) | 1 | patient_2 |
| 3 | 1 | 2 | 1 | 0.0001 |

> \* NOTE: `sample_data_type_id=0` refers to `data_section=Pipeline_A_Result-plot`, even though this value actually came from `Pipeline_B`.

**Expected behavior**

`data_id='patient_id'` will appear in two separate `sample_data_type` rows, once with `data_section='Pipeline_A_Result-plot'` and once with `data_section='Pipeline_B_Result-plot'`

| sample_data_type_id | data_id | data_section | data_key | schema |
| --------------------- | -------- | ------------- | --------- | -------- |
| 0 | patient_id | Pipeline_A_Result-plot | Pipeline_A_Result-plot__patient_id | null |
| 1 | variant_count | Pipeline_A_Result-plot | Pipeline_A_Result-plot__variant_count | null |
| 2 | patient_id | Pipeline_B_Result-plot | Pipeline_B_Result-plot__patient_id | null |
| 3 | pvalue | Pipeline_B_Result-plot | Pipeline_B_Result-plot__pvalue | null |

**System**
- MegaQC: 0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug report - sample_data sometimes references incorrect sample_data_type entries #530

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sample_data_type_id	data_id	data_section	data_key	schema
0	patient_id	Pipeline_A_Result-plot	Pipeline_A_Result-plot__patient_id	null
1	variant_count	Pipeline_A_Result-plot	Pipeline_A_Result-plot__variant_count	null
2	pvalue	Pipeline_B_Result-plot	Pipeline_B_Result-plot__pvalue	null

Bug report - sample_data sometimes references incorrect sample_data_type entries #530

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions