Skip to content

[Bug] --k-score seq:96,prof:80 is not passed correctly to mmseqs search #804

@imathur1

Description

@imathur1

TLDR: I think in search.py, --k-score 'seq:96,prof:80' should not be passed with single quotes because it is being ignored by mmseqs prefilter and defaults to the default parameter. This causes a diff in the MSA result.

Problem

When we don't pass a sensitivity, Colabfold adds the --k-score 'seq:96,prof:80' flag to mmseqs search https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py#L122. mmseqs search passes this to mmseqs prefilter. According to the docs

--k-score TWIN               k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

it must be a TWIN, not STR. Since we are passing a STR, mmseqs prefilter ignores the flag and passes the default, seq:2147483647,prof:2147483647. Here are the first few lines of the logs

INFO:__main__:Running mmseqs createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1 

Converting sequences
[
Time for merging to qdb_h: 0h 0m 0s 0ms
Time for merging to qdb: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 30ms
INFO:__main__:Running mmseqs search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score 'seq:96,prof:80'
Create directory ishaan_tmp/tmp
search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score 'seq:96,prof:80' 

prefilter ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db.idx ishaan_tmp/tmp/7040578792334987330/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 180 --compressed 0 -v 3 

Index version: 16
Generated by:  18.8cc5c
ScoreMatrix:  VTML80.out
Query database size: 2 type: Aminoacid
Estimated memory consumption: 375G
Target database size: 36293491 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 122
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2
Target db start 1 to 36293491
[=================================================================] 2 0s 0ms
76.316176 k-mers per position
82364 DB matches per sequence
0 overflows

You can see the script passes --k-score 'seq:96,prof:80' (with quotations) to mmseqs search but mmseqs prefilter ignores that and runs the default --k-score seq:2147483647,prof:2147483647. And the k-mer similarity threshold is 122 not 96. If I modify search.py to not pass the single quotes, here are the logs.

INFO:__main__:Running mmseqs createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1 

Converting sequences
[
Time for merging to qdb_h: 0h 0m 0s 0ms
Time for merging to qdb: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 37ms
INFO:__main__:Running mmseqs search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score seq:96,prof:80
Create directory ishaan_tmp/tmp
search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score seq:96,prof:80 

prefilter ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db.idx ishaan_tmp/tmp/6171524400515947378/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --target-search-mode 0 --k-score seq:96,prof:80 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 180 --compressed 0 -v 3 

Index version: 16
Generated by:  18.8cc5c
ScoreMatrix:  VTML80.out
Query database size: 2 type: Aminoacid
Estimated memory consumption: 392G
Target database size: 36293491 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 96
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2
Target db start 1 to 36293491
[=================================================================] 2 0s 0ms
1881.948529 k-mers per position
1868877 DB matches per sequence
0 overflows

Now it seems the flag is being passed correctly. And we get a different output (the number of DB matches is different)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions