|
17 | 17 | { |
18 | 18 | "cell_type": "code", |
19 | 19 | "execution_count": 3, |
20 | | - "metadata": { |
21 | | - "collapsed": false |
22 | | - }, |
| 20 | + "metadata": {}, |
23 | 21 | "outputs": [ |
24 | 22 | { |
25 | 23 | "name": "stdout", |
|
59 | 57 | "metadata": {}, |
60 | 58 | "source": [ |
61 | 59 | "### 2. Quality-check your raw (and dirty) reads\n", |
62 | | - "To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n", |
63 | | - "\n", |
64 | | - "#### a) Prepare text-file with file-paths\n", |
65 | | - "Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n", |
66 | | - "\n", |
67 | | - "<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n", |
68 | | - "\n", |
69 | | - " for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n", |
70 | | - " \n", |
71 | | - " for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n", |
72 | | - " \n", |
73 | | - "After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:" |
74 | | - ] |
75 | | - }, |
76 | | - { |
77 | | - "cell_type": "code", |
78 | | - "execution_count": 15, |
79 | | - "metadata": { |
80 | | - "collapsed": false, |
81 | | - "scrolled": true |
82 | | - }, |
83 | | - "outputs": [ |
84 | | - { |
85 | | - "name": "stdout", |
86 | | - "output_type": "stream", |
87 | | - "text": [ |
88 | | - "../../data/raw/fastq/1061_R1.fastq\n", |
89 | | - "../../data/raw/fastq/1063_R1.fastq\n", |
90 | | - "../../data/raw/fastq/1064_R1.fastq\n", |
91 | | - "../../data/raw/fastq/1065_R1.fastq\n", |
92 | | - "../../data/raw/fastq/1068_R1.fastq\n", |
93 | | - " ... \n", |
94 | | - "../../data/raw/fastq/1140_R2.fastq\n", |
95 | | - "../../data/raw/fastq/1164_R2.fastq\n", |
96 | | - "../../data/raw/fastq/1165_R2.fastq\n", |
97 | | - "../../data/raw/fastq/1166_R2.fastq\n", |
98 | | - "../../data/raw/fastq/1167_R2.fastq\n" |
99 | | - ] |
100 | | - } |
101 | | - ], |
102 | | - "source": [ |
103 | | - "%%bash\n", |
104 | | - "head -n 5 ../../data/processed/raw_fastq_file_list.txt\n", |
105 | | - "echo ' ... '\n", |
106 | | - "tail -n 5 ../../data/processed/raw_fastq_file_list.txt" |
107 | | - ] |
108 | | - }, |
109 | | - { |
110 | | - "cell_type": "markdown", |
111 | | - "metadata": {}, |
112 | | - "source": [ |
113 | | - "#### b) Run `fastqc` for quality check" |
114 | | - ] |
115 | | - }, |
116 | | - { |
117 | | - "cell_type": "markdown", |
118 | | - "metadata": {}, |
119 | | - "source": [ |
120 | | - "Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n", |
121 | | - "\n", |
122 | | - " mkdir ../../data/processed/fastqc_results/raw\n", |
123 | | - " fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt) \n", |
| 60 | + "To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n", |
124 | 61 | "\n", |
125 | | - "\n", |
126 | | - "`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)." |
127 | | - ] |
128 | | - }, |
129 | | - { |
130 | | - "cell_type": "markdown", |
131 | | - "metadata": {}, |
132 | | - "source": [ |
133 | | - "#### 3. Visualize results" |
134 | | - ] |
135 | | - }, |
136 | | - { |
137 | | - "cell_type": "markdown", |
138 | | - "metadata": {}, |
139 | | - "source": [ |
140 | | - "Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n", |
141 | | - "\n", |
142 | | - " Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n", |
143 | | - "\n", |
144 | | - "This is what the quality check results look like for the uncleaned reads: " |
| 62 | + "`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`" |
145 | 63 | ] |
146 | 64 | }, |
147 | 65 | { |
148 | 66 | "cell_type": "code", |
149 | 67 | "execution_count": 20, |
150 | 68 | "metadata": { |
151 | | - "collapsed": false, |
152 | 69 | "scrolled": true |
153 | 70 | }, |
154 | 71 | "outputs": [ |
|
211 | 128 | "cell_type": "code", |
212 | 129 | "execution_count": 21, |
213 | 130 | "metadata": { |
214 | | - "collapsed": false, |
215 | 131 | "scrolled": true |
216 | 132 | }, |
217 | 133 | "outputs": [ |
|
323 | 239 | { |
324 | 240 | "cell_type": "code", |
325 | 241 | "execution_count": 16, |
326 | | - "metadata": { |
327 | | - "collapsed": false |
328 | | - }, |
| 242 | + "metadata": {}, |
329 | 243 | "outputs": [ |
330 | 244 | { |
331 | 245 | "name": "stdout", |
|
395 | 309 | "Let's run the script as in this example command:\n", |
396 | 310 | "<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n", |
397 | 311 | "\n", |
398 | | - " secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n", |
| 312 | + " secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n", |
399 | 313 | " \n", |
400 | 314 | "`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample." |
401 | 315 | ] |
|
405 | 319 | "metadata": {}, |
406 | 320 | "source": [ |
407 | 321 | "#### c) Check quality of the results\n", |
408 | | - "After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n", |
409 | | - "\n", |
410 | | - " for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n", |
411 | | - " \n", |
412 | | - " for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n", |
| 322 | + "After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n", |
413 | 323 | "\n", |
414 | | - " mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n", |
415 | | - " fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n", |
416 | | - " \n", |
417 | | - "Finally we run our R-script which plots the results:" |
| 324 | + "`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`" |
418 | 325 | ] |
419 | 326 | }, |
420 | 327 | { |
421 | 328 | "cell_type": "code", |
422 | 329 | "execution_count": 30, |
423 | | - "metadata": { |
424 | | - "collapsed": false |
425 | | - }, |
| 330 | + "metadata": {}, |
426 | 331 | "outputs": [ |
427 | 332 | { |
428 | 333 | "name": "stdout", |
|
480 | 385 | "source": [ |
481 | 386 | "As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n", |
482 | 387 | "\n", |
483 | | - " secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n", |
| 388 | + " secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n", |
484 | 389 | "\n", |
485 | | - "For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))." |
| 390 | + "Let's check the final quality of the data:\n", |
| 391 | + "\n", |
| 392 | + " secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings" |
486 | 393 | ] |
487 | 394 | }, |
488 | 395 | { |
489 | 396 | "cell_type": "code", |
490 | 397 | "execution_count": 7, |
491 | | - "metadata": { |
492 | | - "collapsed": false |
493 | | - }, |
| 398 | + "metadata": {}, |
494 | 399 | "outputs": [ |
495 | 400 | { |
496 | 401 | "name": "stdout", |
|
559 | 464 | "name": "python", |
560 | 465 | "nbconvert_exporter": "python", |
561 | 466 | "pygments_lexer": "ipython3", |
562 | | - "version": "3.6.0" |
| 467 | + "version": "3.6.4" |
563 | 468 | } |
564 | 469 | }, |
565 | 470 | "nbformat": 4, |
|
0 commit comments