[Bug] --k-score seq:96,prof:80 is not passed correctly to mmseqs search

TLDR: I think in `search.py`, `--k-score 'seq:96,prof:80' ` should not be passed with single quotes because it is being ignored by `mmseqs prefilter` and defaults to the default parameter. This causes a diff in the MSA result.

## Problem
When we don't pass a sensitivity, Colabfold adds the `--k-score 'seq:96,prof:80'` flag to `mmseqs search` https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py#L122. `mmseqs search` passes this to `mmseqs prefilter`. According to the docs
```
--k-score TWIN               k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]
```
it must be a TWIN, not STR. Since we are passing a STR, `mmseqs prefilter` ignores the flag and passes the default, `seq:2147483647,prof:2147483647`. Here are the first few lines of the logs

```
INFO:__main__:Running mmseqs createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1 

Converting sequences
[
Time for merging to qdb_h: 0h 0m 0s 0ms
Time for merging to qdb: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 30ms
INFO:__main__:Running mmseqs search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score 'seq:96,prof:80'
Create directory ishaan_tmp/tmp
search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score 'seq:96,prof:80' 

prefilter ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db.idx ishaan_tmp/tmp/7040578792334987330/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 180 --compressed 0 -v 3 

Index version: 16
Generated by:  18.8cc5c
ScoreMatrix:  VTML80.out
Query database size: 2 type: Aminoacid
Estimated memory consumption: 375G
Target database size: 36293491 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 122
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2
Target db start 1 to 36293491
[=================================================================] 2 0s 0ms
76.316176 k-mers per position
82364 DB matches per sequence
0 overflows
```

You can see the script passes `--k-score 'seq:96,prof:80' ` (with quotations) to `mmseqs search` but `mmseqs prefilter` ignores that and runs the default `--k-score seq:2147483647,prof:2147483647`. And the k-mer similarity threshold is 122 not 96. If I modify `search.py` to not pass the single quotes, here are the logs.
```
INFO:__main__:Running mmseqs createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1 

Converting sequences
[
Time for merging to qdb_h: 0h 0m 0s 0ms
Time for merging to qdb: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 37ms
INFO:__main__:Running mmseqs search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score seq:96,prof:80
Create directory ishaan_tmp/tmp
search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score seq:96,prof:80 

prefilter ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db.idx ishaan_tmp/tmp/6171524400515947378/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --target-search-mode 0 --k-score seq:96,prof:80 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 180 --compressed 0 -v 3 

Index version: 16
Generated by:  18.8cc5c
ScoreMatrix:  VTML80.out
Query database size: 2 type: Aminoacid
Estimated memory consumption: 392G
Target database size: 36293491 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 96
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2
Target db start 1 to 36293491
[=================================================================] 2 0s 0ms
1881.948529 k-mers per position
1868877 DB matches per sequence
0 overflows
```
Now it seems the flag is being passed correctly. And we get a different output (the number of DB matches is different)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] --k-score seq:96,prof:80 is not passed correctly to mmseqs search #804

Problem

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] --k-score seq:96,prof:80 is not passed correctly to mmseqs search #804

Description

Problem

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions