Skip to content

all-vs-all kmer sharing and mash distances for a set of proteins #185

@stubrown

Description

@stubrown

Hello Mash authors. I am working with mash for protein clustering. We have a large database of 8 million proteins and the current clustering method relies on several stages of all-vs-all BLAST comparisons.

I have implemented a mash distance to quantify the dispersion of proteins within a cluster. Right now I choose a centroid for the cluster with a complex process that relies on Blast e-values computed elsewhere in the workflow, and then run a mash of this central protein vs a FASTA of all other proteins. This generates a 'distance to the center' metric for each protein that is a valid measure of cluster dispersion.

Intuitively, it seems to me that it should be possible to build a single hash for a set of proteins and then extract in one operation all of the pairwise kmer sharing counts and efficiently create a matrix of distances for all comparisons among all proteins in the set. This would scale very well to millions of proteins - much better than pairwise all-vs-all BLAST. Then the matrix can be used as input for a clustering algorithm.

your thoughts on this would be very helpful
[email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions