Skip to content

Commit e31aab9

Browse files
author
Tobias Hofmann
committed
small bug fix
1 parent 030cda2 commit e31aab9

File tree

13 files changed

+51
-628
lines changed

13 files changed

+51
-628
lines changed

docs/notebook/.ipynb_checkpoints/cleaning_trimming-checkpoint.ipynb

Lines changed: 14 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,7 @@
1717
{
1818
"cell_type": "code",
1919
"execution_count": 3,
20-
"metadata": {
21-
"collapsed": false
22-
},
20+
"metadata": {},
2321
"outputs": [
2422
{
2523
"name": "stdout",
@@ -59,96 +57,15 @@
5957
"metadata": {},
6058
"source": [
6159
"### 2. Quality-check your raw (and dirty) reads\n",
62-
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n",
63-
"\n",
64-
"#### a) Prepare text-file with file-paths\n",
65-
"Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n",
66-
"\n",
67-
"<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n",
68-
"\n",
69-
" for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n",
70-
" \n",
71-
" for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n",
72-
" \n",
73-
"After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:"
74-
]
75-
},
76-
{
77-
"cell_type": "code",
78-
"execution_count": 15,
79-
"metadata": {
80-
"collapsed": false,
81-
"scrolled": true
82-
},
83-
"outputs": [
84-
{
85-
"name": "stdout",
86-
"output_type": "stream",
87-
"text": [
88-
"../../data/raw/fastq/1061_R1.fastq\n",
89-
"../../data/raw/fastq/1063_R1.fastq\n",
90-
"../../data/raw/fastq/1064_R1.fastq\n",
91-
"../../data/raw/fastq/1065_R1.fastq\n",
92-
"../../data/raw/fastq/1068_R1.fastq\n",
93-
" ... \n",
94-
"../../data/raw/fastq/1140_R2.fastq\n",
95-
"../../data/raw/fastq/1164_R2.fastq\n",
96-
"../../data/raw/fastq/1165_R2.fastq\n",
97-
"../../data/raw/fastq/1166_R2.fastq\n",
98-
"../../data/raw/fastq/1167_R2.fastq\n"
99-
]
100-
}
101-
],
102-
"source": [
103-
"%%bash\n",
104-
"head -n 5 ../../data/processed/raw_fastq_file_list.txt\n",
105-
"echo ' ... '\n",
106-
"tail -n 5 ../../data/processed/raw_fastq_file_list.txt"
107-
]
108-
},
109-
{
110-
"cell_type": "markdown",
111-
"metadata": {},
112-
"source": [
113-
"#### b) Run `fastqc` for quality check"
114-
]
115-
},
116-
{
117-
"cell_type": "markdown",
118-
"metadata": {},
119-
"source": [
120-
"Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n",
121-
"\n",
122-
" mkdir ../../data/processed/fastqc_results/raw\n",
123-
" fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt) \n",
60+
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n",
12461
"\n",
125-
"\n",
126-
"`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)."
127-
]
128-
},
129-
{
130-
"cell_type": "markdown",
131-
"metadata": {},
132-
"source": [
133-
"#### 3. Visualize results"
134-
]
135-
},
136-
{
137-
"cell_type": "markdown",
138-
"metadata": {},
139-
"source": [
140-
"Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n",
141-
"\n",
142-
" Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n",
143-
"\n",
144-
"This is what the quality check results look like for the uncleaned reads: "
62+
"`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`"
14563
]
14664
},
14765
{
14866
"cell_type": "code",
14967
"execution_count": 20,
15068
"metadata": {
151-
"collapsed": false,
15269
"scrolled": true
15370
},
15471
"outputs": [
@@ -211,7 +128,6 @@
211128
"cell_type": "code",
212129
"execution_count": 21,
213130
"metadata": {
214-
"collapsed": false,
215131
"scrolled": true
216132
},
217133
"outputs": [
@@ -323,9 +239,7 @@
323239
{
324240
"cell_type": "code",
325241
"execution_count": 16,
326-
"metadata": {
327-
"collapsed": false
328-
},
242+
"metadata": {},
329243
"outputs": [
330244
{
331245
"name": "stdout",
@@ -395,7 +309,7 @@
395309
"Let's run the script as in this example command:\n",
396310
"<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n",
397311
"\n",
398-
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n",
312+
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n",
399313
" \n",
400314
"`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample."
401315
]
@@ -405,24 +319,15 @@
405319
"metadata": {},
406320
"source": [
407321
"#### c) Check quality of the results\n",
408-
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n",
409-
"\n",
410-
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n",
411-
" \n",
412-
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n",
322+
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n",
413323
"\n",
414-
" mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n",
415-
" fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n",
416-
" \n",
417-
"Finally we run our R-script which plots the results:"
324+
"`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`"
418325
]
419326
},
420327
{
421328
"cell_type": "code",
422329
"execution_count": 30,
423-
"metadata": {
424-
"collapsed": false
425-
},
330+
"metadata": {},
426331
"outputs": [
427332
{
428333
"name": "stdout",
@@ -480,17 +385,17 @@
480385
"source": [
481386
"As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n",
482387
"\n",
483-
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
388+
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
484389
"\n",
485-
"For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))."
390+
"Let's check the final quality of the data:\n",
391+
"\n",
392+
" secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings"
486393
]
487394
},
488395
{
489396
"cell_type": "code",
490397
"execution_count": 7,
491-
"metadata": {
492-
"collapsed": false
493-
},
398+
"metadata": {},
494399
"outputs": [
495400
{
496401
"name": "stdout",
@@ -559,7 +464,7 @@
559464
"name": "python",
560465
"nbconvert_exporter": "python",
561466
"pygments_lexer": "ipython3",
562-
"version": "3.6.0"
467+
"version": "3.6.4"
563468
}
564469
},
565470
"nbformat": 4,

docs/notebook/cleaning_trimming.ipynb

Lines changed: 14 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,7 @@
1717
{
1818
"cell_type": "code",
1919
"execution_count": 3,
20-
"metadata": {
21-
"collapsed": false
22-
},
20+
"metadata": {},
2321
"outputs": [
2422
{
2523
"name": "stdout",
@@ -59,96 +57,15 @@
5957
"metadata": {},
6058
"source": [
6159
"### 2. Quality-check your raw (and dirty) reads\n",
62-
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n",
63-
"\n",
64-
"#### a) Prepare text-file with file-paths\n",
65-
"Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n",
66-
"\n",
67-
"<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n",
68-
"\n",
69-
" for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n",
70-
" \n",
71-
" for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n",
72-
" \n",
73-
"After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:"
74-
]
75-
},
76-
{
77-
"cell_type": "code",
78-
"execution_count": 15,
79-
"metadata": {
80-
"collapsed": false,
81-
"scrolled": true
82-
},
83-
"outputs": [
84-
{
85-
"name": "stdout",
86-
"output_type": "stream",
87-
"text": [
88-
"../../data/raw/fastq/1061_R1.fastq\n",
89-
"../../data/raw/fastq/1063_R1.fastq\n",
90-
"../../data/raw/fastq/1064_R1.fastq\n",
91-
"../../data/raw/fastq/1065_R1.fastq\n",
92-
"../../data/raw/fastq/1068_R1.fastq\n",
93-
" ... \n",
94-
"../../data/raw/fastq/1140_R2.fastq\n",
95-
"../../data/raw/fastq/1164_R2.fastq\n",
96-
"../../data/raw/fastq/1165_R2.fastq\n",
97-
"../../data/raw/fastq/1166_R2.fastq\n",
98-
"../../data/raw/fastq/1167_R2.fastq\n"
99-
]
100-
}
101-
],
102-
"source": [
103-
"%%bash\n",
104-
"head -n 5 ../../data/processed/raw_fastq_file_list.txt\n",
105-
"echo ' ... '\n",
106-
"tail -n 5 ../../data/processed/raw_fastq_file_list.txt"
107-
]
108-
},
109-
{
110-
"cell_type": "markdown",
111-
"metadata": {},
112-
"source": [
113-
"#### b) Run `fastqc` for quality check"
114-
]
115-
},
116-
{
117-
"cell_type": "markdown",
118-
"metadata": {},
119-
"source": [
120-
"Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n",
121-
"\n",
122-
" mkdir ../../data/processed/fastqc_results/raw\n",
123-
" fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt) \n",
60+
"To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n",
12461
"\n",
125-
"\n",
126-
"`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)."
127-
]
128-
},
129-
{
130-
"cell_type": "markdown",
131-
"metadata": {},
132-
"source": [
133-
"#### 3. Visualize results"
134-
]
135-
},
136-
{
137-
"cell_type": "markdown",
138-
"metadata": {},
139-
"source": [
140-
"Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n",
141-
"\n",
142-
" Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n",
143-
"\n",
144-
"This is what the quality check results look like for the uncleaned reads: "
62+
"`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`"
14563
]
14664
},
14765
{
14866
"cell_type": "code",
14967
"execution_count": 20,
15068
"metadata": {
151-
"collapsed": false,
15269
"scrolled": true
15370
},
15471
"outputs": [
@@ -211,7 +128,6 @@
211128
"cell_type": "code",
212129
"execution_count": 21,
213130
"metadata": {
214-
"collapsed": false,
215131
"scrolled": true
216132
},
217133
"outputs": [
@@ -323,9 +239,7 @@
323239
{
324240
"cell_type": "code",
325241
"execution_count": 16,
326-
"metadata": {
327-
"collapsed": false
328-
},
242+
"metadata": {},
329243
"outputs": [
330244
{
331245
"name": "stdout",
@@ -395,7 +309,7 @@
395309
"Let's run the script as in this example command:\n",
396310
"<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n",
397311
"\n",
398-
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n",
312+
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n",
399313
" \n",
400314
"`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample."
401315
]
@@ -405,24 +319,15 @@
405319
"metadata": {},
406320
"source": [
407321
"#### c) Check quality of the results\n",
408-
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n",
409-
"\n",
410-
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n",
411-
" \n",
412-
" for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n",
322+
"After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n",
413323
"\n",
414-
" mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n",
415-
" fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n",
416-
" \n",
417-
"Finally we run our R-script which plots the results:"
324+
"`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`"
418325
]
419326
},
420327
{
421328
"cell_type": "code",
422329
"execution_count": 30,
423-
"metadata": {
424-
"collapsed": false
425-
},
330+
"metadata": {},
426331
"outputs": [
427332
{
428333
"name": "stdout",
@@ -480,17 +385,17 @@
480385
"source": [
481386
"As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n",
482387
"\n",
483-
" secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
388+
" secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
484389
"\n",
485-
"For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))."
390+
"Let's check the final quality of the data:\n",
391+
"\n",
392+
" secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings"
486393
]
487394
},
488395
{
489396
"cell_type": "code",
490397
"execution_count": 7,
491-
"metadata": {
492-
"collapsed": false
493-
},
398+
"metadata": {},
494399
"outputs": [
495400
{
496401
"name": "stdout",
@@ -559,7 +464,7 @@
559464
"name": "python",
560465
"nbconvert_exporter": "python",
561466
"pygments_lexer": "ipython3",
562-
"version": "3.6.0"
467+
"version": "3.6.4"
563468
}
564469
},
565470
"nbformat": 4,

0 commit comments

Comments
 (0)