- We are using Python 2.7. You will only need the folder BorreliaPipeline to run the pipeline. This can be done in the following
gitcommands:
git init
git remote add origin https://github.com/WGS-TB/MLST
git fetch origin
git checkout origin/master -- BorreliaPipeline
-
SRA toolkit: To download sample reads. Please download the toolkit here: https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/. You will need
fastq-dumpto download the sample reads. Our samples are downloaded usingfastq-dumpv2.8.2. Copyfastq-dumpinto your bin folder (ours: /usr/local/bin) or provide the path tofastq-dumptodownload_samples.pyscript. The python script calls thefastq-dumpcommand to download sample reads. -
Bowtie v0.12.7: For read mapping. Please download it here: https://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/. You will need both
bowtieandbowtie-build, copy these into your bin folder or provide the path to the folder containingbowtieandbowtie-buildtomapSamples.pyscript. The python script callsbowtieandbowtie-buildcommands to map reads. -
CPLEX Python API: For solving ILP. You will need academic license to download CPLEX. Once you installed CPLEX, follow the instructions here to install CPLEX-Python modules: https://www.ibm.com/support/knowledgecenter/SSSA5P_12.5.1/ilog.odms.cplex.help/CPLEX/GettingStarted/topics/set_up/Python_setup.html
-
ART: For simulating reads. Please download it here: https://www.niehs.nih.gov/research/resources/software/biostatistics/art/. You will need
art_illumina, copy this into your bin file or provide the path forart_illuminatorun_sim.py. This script runs the allele simulation. -
Samtools v1.3.1: For extracting information from sam files. Please download it here: https://sourceforge.net/projects/samtools/files/samtools/1.3.1/. You will need
samtools, copy it into your bin folder or provide the path tosamtoolstomapSamples.py. The python script calls thesamtoolscommand while extracting information from sam files after mapping. -
In computing edit distance, we used the package by https://github.com/aflc/editdistance and we hereby acknowledge Hiroyuki Tanaka, where the algorithm is proposed by Heikki Hyyrö, "Explaining and extending the bit-parallel approximate string matching algorithm of Myers", (2001). Possible Python packages you will need: numpy, pandas, matplotlib, linecache, argparse . Please use
pip installto install these packages if you do not have any of the packages. -
We also use the package PyEMD for EMD calculations, where we hereby acknowledge Gary Doran for his PyEMD package: https://github.com/garydoranjr/pyemd.
-
We are not sure if we cover all the Python packages here, please let us know if we miss any!
-
The required scripts and files are in
pipelinefolder inBorreliaPipelinefolder (except for sample reads as the files are big). -
Run
python download_samples.py [-h] [-c CMD]in thepipelinefolder to download the samples' reads (downloaded samples are based on the SRR_Acc_List.txt).download_samples.pyhas an optional argument-c, which sets the path forfastq-dumpcommand needed to download the reads for Borrelia samples. By default, it assumesfastq-dumpis in your bin folder. After running this script, it will create adatafolder which contains folders for each sample. -
Since we are using Bowtie to map the sample reads, we run
python mapSamples.py [-h] [-c NUMOFCORES] [-b BOWTIE] [-s SAMTOOLS]to map the sample reads.-cspecifies the number of core for running Bowtie, default is 4.-bspecifies the path to the folder containing two commandsbowtieandbowtie-build, default assumes these two commands are in the user's bin folder.-sspecifies the path ofsamtools, default assumes the command is in user's bin folder. It will be easier to understand with a working example:python mapSamples.py -c 2 -b ../bowtieFolder/ -s ../../samtools, for example. The script will create avariantsAndPropfolder containing a folder for each sample, where each of these folder contains a list ofsampleX_geneY_paired_reads.txttext files, in which the text file contains information needed for our first stage ILP pipeline. These information are extracted from.samfile. -
To solve the allele diversity problem, run the following script in the pipeline folder:
python alleleDiversity.py [-h] [-s SAMPLE]
If you want to run on a specific sample, you can indicate using -s. Otherwise it will run on all samples in variantsAndProp. For example, you could run python alleleDiversity.py -s SRR2034333. This script will output some intermediate results to the screen and create 8 CSV files (for each locus) in each of the sample folder in variantsAndProp, named as geneX_proportions.csv which contains the alleles and their proportions identified at geneX.
The script will output some intermediate results for you to keep track.
- To solve the global strain diversity problem, run the following script:
python globalStrainDiversity.py [-h] [-o OUTPUT] [-oc OBJECTIVECOMPONENT] [-timelim TIMELIMIT] [-g GAP] [-pathToDistMat PTDM]
Specify -o, the folder name for the results to be stored. Default will be strainsAndProp. The -oc argument specifies different formulation of the objective function, 'all' includes all components of the objective function, 'noPropAndErr' omits the proportion and error component. -g specifies the relative gap in % to stop the solver, default is 5%. -timelim specifies the maximum time limit in seconds before stopping the solver, default is 600 seconds. The program will stop if and only if -g and -timelim are both satisfied. For example, if -g is 5 and timelim is 600, if the relative gap is above 5% and the 600 seconds have passed, the program will not stop until gap of 5% is reached. -pathToDistMat is the folder name containing the distance matrices for each gene, which is provided and named 'editDist'.
The script will output some intermediate results for you to keep track. It will create a new folder in your current directory with name specified by -o. The folder contains a list of sampleX_strainsAndProportions.csv with sampleX corresponds a particular sample, and the csv file contains the strains and their proportions identified in sampleX.
-
The required scripts and files to run the simulation are in the
alleleSimulationfolder in theBorreliaPipelinefolder. -
In the
alleleSimulationfolder, you only have to run
python run_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-e EDITDIST] [-fp FULLPAPER] [-b BOWTIE] [-s SAMTOOLS] [-a ART]
-i specifies the number of simulations on each gene, default is 40. -f specifies the name of the folder to store the results, default is simulation_results. -c specifies the coverage to test on, default is 30. -b specifies the path to the folder containing bowtie and bowtie-build commands, default assumes that both commands are in user's bin folder. -s specifies the path to samtools, default assumes it's in your bin folder. -a specifies the path to art_illumina, default assumes that it's in the user's bin folder. -fp set to True means we run a 2D parameters experiments(coverage and editDist), which is presented in our paper. In the results folder, it will contain .csv and .png files representing different statistics.
- You will need to install
kallistoand the link is provided here: https://pachterlab.github.io/kallisto/download. To run simulation using kallisto, run
run_kallisto_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-a ART] [-k KAL] [-fp FULLPAPER]
Instructions are similar to run_sim.py, with-k refers to the path to kallisto and default assumes it is in user's bin folder.
- To run novel allele experiment, run
python run_loo_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-b BOWTIE] [-s SAMTOOLS] [-a ART]
Instructions are the same as running run_sim.py
- In the
BorreliaPipelinefolder,strainAndFullPipeline_simcontains the scripts and files to run the SDP simulation(with no reads involved) and full pipeline simulation for our method.strainEST_comparecontains the scripts to run the full pipeline simulation which compares with strainEST. The link to install strainEST is here: https://github.com/compmetagen/strainest. Details about running these simulations are written in separate README files instead in each of the folder.
A sample script to compute the MST is in the pipeline folder and can be altered for your own analysis.