-
Notifications
You must be signed in to change notification settings - Fork 694
Description
TLDR: I think in search.py, --k-score 'seq:96,prof:80' should not be passed with single quotes because it is being ignored by mmseqs prefilter and defaults to the default parameter. This causes a diff in the MSA result.
Problem
When we don't pass a sensitivity, Colabfold adds the --k-score 'seq:96,prof:80' flag to mmseqs search https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py#L122. mmseqs search passes this to mmseqs prefilter. According to the docs
--k-score TWIN k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]
it must be a TWIN, not STR. Since we are passing a STR, mmseqs prefilter ignores the flag and passes the default, seq:2147483647,prof:2147483647. Here are the first few lines of the logs
INFO:__main__:Running mmseqs createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
Converting sequences
[
Time for merging to qdb_h: 0h 0m 0s 0ms
Time for merging to qdb: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 30ms
INFO:__main__:Running mmseqs search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score 'seq:96,prof:80'
Create directory ishaan_tmp/tmp
search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score 'seq:96,prof:80'
prefilter ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db.idx ishaan_tmp/tmp/7040578792334987330/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 180 --compressed 0 -v 3
Index version: 16
Generated by: 18.8cc5c
ScoreMatrix: VTML80.out
Query database size: 2 type: Aminoacid
Estimated memory consumption: 375G
Target database size: 36293491 type: Aminoacid
Process prefiltering step 1 of 1
k-mer similarity threshold: 122
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2
Target db start 1 to 36293491
[=================================================================] 2 0s 0ms
76.316176 k-mers per position
82364 DB matches per sequence
0 overflows
You can see the script passes --k-score 'seq:96,prof:80' (with quotations) to mmseqs search but mmseqs prefilter ignores that and runs the default --k-score seq:2147483647,prof:2147483647. And the k-mer similarity threshold is 122 not 96. If I modify search.py to not pass the single quotes, here are the logs.
INFO:__main__:Running mmseqs createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
createdb ishaan_tmp/query.fas ishaan_tmp/qdb --shuffle 0 --dbtype 1
Converting sequences
[
Time for merging to qdb_h: 0h 0m 0s 0ms
Time for merging to qdb: 0h 0m 0s 0ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 37ms
INFO:__main__:Running mmseqs search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score seq:96,prof:80
Create directory ishaan_tmp/tmp
search ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db ishaan_tmp/res ishaan_tmp/tmp --threads 180 --num-iterations 3 --db-load-mode 2 -a -e 0.1 --max-seqs 10000 --prefilter-mode 0 --k-score seq:96,prof:80
prefilter ishaan_tmp/qdb /home/ec2-user/ColabFold/uniref30_2302_db.idx ishaan_tmp/tmp/6171524400515947378/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 0 --target-search-mode 0 --k-score seq:96,prof:80 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 180 --compressed 0 -v 3
Index version: 16
Generated by: 18.8cc5c
ScoreMatrix: VTML80.out
Query database size: 2 type: Aminoacid
Estimated memory consumption: 392G
Target database size: 36293491 type: Aminoacid
Process prefiltering step 1 of 1
k-mer similarity threshold: 96
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 2
Target db start 1 to 36293491
[=================================================================] 2 0s 0ms
1881.948529 k-mers per position
1868877 DB matches per sequence
0 overflows
Now it seems the flag is being passed correctly. And we get a different output (the number of DB matches is different)