-
Notifications
You must be signed in to change notification settings - Fork 1
Description
"I completely agree that as it is this diff is not very usable, but I do think we can gain some benefit from regression testing (as this is basically a regression test we just confirm that SILO behaves the same on the same datasets from what I see we don't analyze the correctness of the queries).
We use csv-diff in the pathoplexus regression tests as this gives us a human readable output of the differences between two tsv or csv files (https://github.com/pathoplexus/pathoplexus/blob/main/data-integrity-tests/regression-testing/Snakefile#L173). For example:
176 rows changed
submissionId: AB371719.1
clade: "outgroup" => "unassigned"
submissionId: AB371722.1
clade: "outgroup" => "unassigned"
I think it would be awesome to do sth similar here.
I would suggest we change the output files to tsv file (this is an output SILO can produce) and then we can store them as compressed files to save space but when comparing decompress the files. That way we can produce clear, human readable test results."
Originally posted by @anna-parker in #87 (comment)
The main issue is that the files are quite large when uncompressed (several MB or 10 MB, depending on the organism), i.e. it's reasonable to keep them compressed.