-
Notifications
You must be signed in to change notification settings - Fork 3
6. Input and output data
These options are always mandatory: --contigs and --reference. --contigs should point to the path of a FASTA file with one or more DNA sequences and --reference should point to the path of a FASTA file with one or more amino acid sequences. It doesn't matter whether the sequences were aligned or not since gap characters (-) are removed before the alignment step.
Each sequence identifier line should be annotated with a species name in the following format: <species><delimiter><rest>. The default species delimiter is an at sign @, so the format becomes <species>@<rest>. For example, D_melanogaster@16S is a valid sequence identifier format.
To add the species name, D_melanogaster to a set of sequences with the name contigs.fa, you can use sed:
Tip: remove
-ito show the results before committing to the replacement
sed -i 's/^>/\>D_melanogaster@/g' contigs.faReverting back to an unannotated format is as easy as typing the following into your command-line:
sed -i 's/\>D_melanogaster@/g' contigs.fa
If you provide a directory as an input, Patchwork will look for all of the FASTA files within that directory (but not in the subdirectories). These are the filetype extensions that are recognized by Patchwork: .fa, .fas, .fasta, .fna, .faa, .fsa, .ffn, and .frn.
If you did not specify an output directory using the --output-dir flag, then the output files
will be saved to a folder called patchwork_output. Here is an overview of the output directory:
patchwork_output/
├── database.dmnd
├── diamond_blastx.log
├── diamond_makedb.log
├── diamond_out
│ ├── ...
│ └── SEQUENCE_NAME.tsv
├── dna_query_sequences
│ ├── ...
│ └── SEQUENCE_NAME.fas
├── plots
│ ├── percent_identity.png
│ └── query_coverage.png
├── query_sequences
│ ├── ...
│ └── SEQUENCE_NAME.fas
├── sequence_stats
| ├── average.csv
| └── statistics.csv
├── trimmed_alignments.txt
└── untrimmed_alignments.txt
Here is a short description of what each of these files are:
-
database.dmndis the datebase file that was created by DIAMOND -
diamond_blastx.logcontain the DIAMOND output in plain-text -
diamond_makedb.logcontain the DIAMOND output for the database construction in plain-text -
diamond_outcontains the result for each individual reference sequence -
dna_query_sequencescontains all the merged, non-translated (i.e. DNA) query sequences in FASTA format -
plotscontains basic visualizations of result quality (plots/percent_identity.pngandplots/query_coverage.png) -
query_sequencescontains all the merged and translated query sequences in FASTA format -
sequence_statscontains basic statistics for both each individual search (sequence_stats/statistics.csv) and everything together (sequence_stats/average.csv) -
trimmed_alignments.txtanduntrimmed_alignments.txtcontain a visual representation of each alignment (trimmed and untrimmed, respectively)
The following is an example of the structure of a [trimmed_|untrimmed_]alignments.txt file:
1. -----------------------------------------------------------------------------
Reference ID: Helobdella_robusta@366936at33208_6412_0:004149
Reference Length: 294
Query Length: 294
Contigs: 320
Matches: 193
Mismatches: 101
Deletions: 0
Occupancy: 1.0
seq: 1 VEEYEKLERIGEGTYGVVYKAKNVKTNTLVALKRGRFDNEEEGVPGTAIREISLLEALEH 60
||||| ||||||| | |||| | ||||| | | |||| | ||| || | |
ref: 1 MQKYEKLEKIGEGTYGTVFKAKNRETQEIVALKRVRLDDDDEGVPSSALREICLLKELNH 60
seq: 61 PNIVTLQDVIETEKKIYLVFEYLTMDLKKYMDALNGELPPDTVKTFLFQLLRGLAYCHAR 120
||| | || ||| ||||| ||||| | ||| ||||| | ||| ||| || |
ref: 61 KNIVRLCDVLHSEKKLTLVFEYSDQDLKKYFDSCNGEIDPDTVKSFMYQLLKGLAFCHGR 120
seq: 121 RILHRDLKPQNLLINKNGELKLADFGLARAFGVPVRCYTHEVVTLWYRAPEVLLQDKLYT 180
|||||||||||||||||||||||||||||| ||||| |||||||| | || |||
ref: 121 NVLHRDLKPQNLLINKNGELKLADFGLARAFGIPVRCYSAEVVTLWYRPPDVLFGAKLYS 180
seq: 181 TSIDLWSVGCIFGELANAGRPLWPGNDISDECKNIIKLLGTPTDDTWPEGYQLSQLKPYP 240
|||| || |||| ||||||||| |||| | | | ||||||| |||| || ||||
ref: 181 TSIDMWSAGCIFAELANAGRPLFPGNDVDDQLKRIFKLLGTPTEDTWPGFTQLPEYKPYP 240
seq: 241 LFESLTEKLQIVPFIENNFTSFLLRLLTYNPQKRITASDALNHPYFSELNANVK 294
| | | ||||| || || || | | | | |||| ||| |
ref: 241 LYPSSTNWLQIVPKLNSKGRDLLLSLLVCNPSQRMGADDSMKHSYFSEMNANLK 294