snacc (sequence non-alignment compression & comparison) is a program implementing the normalized compression distance (NCD) specifically for biological data. These distances can be used for clustering, or to rapidly infer phylogenies for large sets of genomes.
To install snacc directly you may use:
virtualenv env --python=python3.6 # optional, but recommended to create a clean environment
source env/bin/activate # if using virtualenv, activate the env
pip install git+https://github.com/SweetiePi/snaccWe recommend you create a conda environment for snacc and install through conda.
snacc requires Python 3.6, so create a conda environment with the right Python version:
conda create --name snacc python=3.6And then activate the environment and install snacc:
source activate snacc
conda install -c asweeten snaccWhen inside the snacc conda environment, you can verify correct isntallation by running snacc -h.
- Most basic usage
snacc [folder with sequences] -o [output name]- Intermediate: customize number of threads and compression algorithm
snacc -d [folder with sequences] -o [output name] -n 24 -c gzip
- Full control
snacc \
--directory [folder with sequences] \
--output [output name] \
--num-threads 24 \
--compression lz4 \
--fast-mode True \
--reverse-compliment False- Analysis time: 2018-10-14 15:18:17.257619
- Analysis duration: 0:00:26.383997
- Compression method: lz4
- Reverse complement: False
- Burrows-Wheeler transform: False
- Output filepath: test.csv
- Python: 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
- snacc: 0.0.1
- scikit-learn: 0.20.0
- py-lz4framed: 0.12.0
- umap-learn: 0.3.5
- /test_dataset/mysteryGenome_1.fasta
- /test_dataset/mysteryGenome_2.fasta
