AntonelliLab
diff --git a/‎docs/notebook/.ipynb_checkpoints/cleaning_trimming-checkpoint.ipynb‎
Lines changed: 14 additions & 109 deletions b/‎docs/notebook/.ipynb_checkpoints/cleaning_trimming-checkpoint.ipynb‎
Lines changed: 14 additions & 109 deletions
diff --git a/‎docs/notebook/cleaning_trimming.ipynb‎
Lines changed: 14 additions & 109 deletions b/‎docs/notebook/cleaning_trimming.ipynb‎
Lines changed: 14 additions & 109 deletions
@@ -17,9 +17,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -59,96 +57,15 @@
    "metadata": {},
    "source": [
     "### 2. Quality-check your raw (and dirty) reads\n",
-    "To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n",
-    "\n",
-    "#### a) Prepare text-file with file-paths\n",
-    "Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n",
-    "\n",
-    "<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n",
-    "\n",
-    "    for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n",
-    "    \n",
-    "    for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n",
-    "    \n",
-    "After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {
-    "collapsed": false,
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "../../data/raw/fastq/1061_R1.fastq\n",
-      "../../data/raw/fastq/1063_R1.fastq\n",
-      "../../data/raw/fastq/1064_R1.fastq\n",
-      "../../data/raw/fastq/1065_R1.fastq\n",
-      "../../data/raw/fastq/1068_R1.fastq\n",
-      "     ...     \n",
-      "../../data/raw/fastq/1140_R2.fastq\n",
-      "../../data/raw/fastq/1164_R2.fastq\n",
-      "../../data/raw/fastq/1165_R2.fastq\n",
-      "../../data/raw/fastq/1166_R2.fastq\n",
-      "../../data/raw/fastq/1167_R2.fastq\n"
-     ]
-    }
-   ],
-   "source": [
-    "%%bash\n",
-    "head -n 5 ../../data/processed/raw_fastq_file_list.txt\n",
-    "echo '     ...     '\n",
-    "tail -n 5 ../../data/processed/raw_fastq_file_list.txt"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### b) Run `fastqc` for quality check"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n",
-    "\n",
-    "    mkdir ../../data/processed/fastqc_results/raw\n",
-    "    fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt)    \n",
+    "To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n",
     "\n",
-    "\n",
-    "`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 3. Visualize results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n",
-    "\n",
-    "    Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n",
-    "\n",
-    "This is what the quality check results look like for the uncleaned reads: "
+    "`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 20,
    "metadata": {
-    "collapsed": false,
     "scrolled": true
    },
    "outputs": [
@@ -211,7 +128,6 @@
    "cell_type": "code",
    "execution_count": 21,
    "metadata": {
-    "collapsed": false,
     "scrolled": true
    },
    "outputs": [
@@ -323,9 +239,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -395,7 +309,7 @@
     "Let's run the script as in this example command:\n",
     "<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n",
     "\n",
-    "    secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n",
+    "    secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n",
     "    \n",
     "`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample."
    ]
@@ -405,24 +319,15 @@
    "metadata": {},
    "source": [
     "#### c) Check quality of the results\n",
-    "After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n",
-    "\n",
-    "    for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n",
-    "    \n",
-    "    for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n",
+    "After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n",
     "\n",
-    "    mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n",
-    "    fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n",
-    "    \n",
-    "Finally we run our R-script which plots the results:"
+    "`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 30,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -480,17 +385,17 @@
    "source": [
     "As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n",
     "\n",
-    "    secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
+    "    secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
     "\n",
-    "For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))."
+    "Let's check the final quality of the data:\n",
+    "\n",
+    "    secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -559,7 +464,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.0"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,
 
@@ -17,9 +17,7 @@
   {
    "cell_type": "code",
    "execution_count": 3,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -59,96 +57,15 @@
    "metadata": {},
    "source": [
     "### 2. Quality-check your raw (and dirty) reads\n",
-    "To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. I'll show you how:\n",
-    "\n",
-    "#### a) Prepare text-file with file-paths\n",
-    "Prepare a file that contains the file paths (absolute paths or relative to your work_dir) of all cleaned fastq-files of interest (you can do it manually or use the following commands, after inserting the correct path to your output folder in the bash for-loop)\n",
-    "\n",
-    "<div class=\"alert alert-block alert-info\">**Adjust path:** Replace `../../data/raw/fastq/*/` with the path to your folder containing the raw reads</div>\n",
-    "\n",
-    "    for dir in ../../data/raw/fastq/*R1.fastq; do echo $dir; done > ../../data/processed/raw_fastq_file_list.txt\n",
-    "    \n",
-    "    for dir in ../../data/raw/fastq/*R2.fastq; do echo $dir; done >> ../../data/processed/raw_fastq_file_list.txt\n",
-    "    \n",
-    "After successfully running these for-loops, the resulting file `raw_fastq_file_list.txt` should contain the path to all samples that you want to quality check:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {
-    "collapsed": false,
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "../../data/raw/fastq/1061_R1.fastq\n",
-      "../../data/raw/fastq/1063_R1.fastq\n",
-      "../../data/raw/fastq/1064_R1.fastq\n",
-      "../../data/raw/fastq/1065_R1.fastq\n",
-      "../../data/raw/fastq/1068_R1.fastq\n",
-      "     ...     \n",
-      "../../data/raw/fastq/1140_R2.fastq\n",
-      "../../data/raw/fastq/1164_R2.fastq\n",
-      "../../data/raw/fastq/1165_R2.fastq\n",
-      "../../data/raw/fastq/1166_R2.fastq\n",
-      "../../data/raw/fastq/1167_R2.fastq\n"
-     ]
-    }
-   ],
-   "source": [
-    "%%bash\n",
-    "head -n 5 ../../data/processed/raw_fastq_file_list.txt\n",
-    "echo '     ...     '\n",
-    "tail -n 5 ../../data/processed/raw_fastq_file_list.txt"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### b) Run `fastqc` for quality check"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now run `fastqc` for all fastq files in order to produce a quality check for all these samples. Make sure to create the output directory manually before running the command (otherwise `fastqc` will return an error).\n",
-    "\n",
-    "    mkdir ../../data/processed/fastqc_results/raw\n",
-    "    fastqc -o ../../data/processed/fastqc_results/raw -f fastq $(cat ../../data/processed/raw_fastq_file_list.txt)    \n",
+    "To convince yourself that the raw reads are not fit for further processing, it is a good idea to first run some quality tests on the raw fastq files. This is easy and straightforward with the `secapr quality_check` function:\n",
     "\n",
-    "\n",
-    "`fastqc` produces two output files per sample: one zip-archive and one .html file. The easiest way to look at the test results of a specific file is to open the html file in your favorite html-reader (e.g. Firefox, Safari, etc.)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### 3. Visualize results"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Since it is somewhat cumbersome to look at all test results for all samples by manually checking all html files, we provide an R-script (in the `src/` folder of the `secapr` GitHub project) which produces a graphical overview of the test results of all samples. This makes it easier to see which samples passed which tests (rather than having to go open each individual report).\n",
-    "\n",
-    "    Rscript ../../src/fastqc_visualization.r -i ../../data/processed/fastqc_results/raw -o ../../data/processed/fastqc_results/raw/summary_all_samples_raw.pdf\n",
-    "\n",
-    "This is what the quality check results look like for the uncleaned reads: "
+    "`secapr quality_check --input ../../data/raw/fastq/ --output ../../data/processed/fastqc_results/raw`"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 20,
    "metadata": {
-    "collapsed": false,
     "scrolled": true
    },
    "outputs": [
@@ -211,7 +128,6 @@
    "cell_type": "code",
    "execution_count": 21,
    "metadata": {
-    "collapsed": false,
     "scrolled": true
    },
    "outputs": [
@@ -323,9 +239,7 @@
   {
    "cell_type": "code",
    "execution_count": 16,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -395,7 +309,7 @@
     "Let's run the script as in this example command:\n",
     "<div class=\"alert alert-block alert-warning\">**Please check:** Is `secapr_env` activated? You can test with `conda info --envs`. Activate the correct environment with `source activate secapr_env`</div>\n",
     "\n",
-    "    secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads_default --index single\n",
+    "    secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads_default --index single\n",
     "    \n",
     "`secapr clean_reads` produces a subfolder for each sample in the output directory, containing the cleaned reads for the respective sample."
    ]
@@ -405,24 +319,15 @@
    "metadata": {},
    "source": [
     "#### c) Check quality of the results\n",
-    "After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. Therefore we create a text file with the file paths to all cleaned fastq files and run `fastqc` by providing this list of files. Note that the `secapr clean_reads` function named all forward-read files with the tag '_READ1_' and all backward-read files with '_READ2_'.\n",
-    "\n",
-    "    for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ1.fastq; do echo $dir; done > ../../data/processed/fastq_file_list.txt\n",
-    "    \n",
-    "    for dir in ../../data/processed/cleaned_trimmed_reads_default/*/*READ2.fastq; do echo $dir; done >> ../../data/processed/fastq_file_list.txt\n",
+    "After cleaning the reads with `secapr clean_reads` with default settings we again perform the quality tests on all cleaned files, just as we did above for the raw reads. \n",
     "\n",
-    "    mkdir ../../data/processed/fastqc_results/cleaned_default_settings\n",
-    "    fastqc -o ../../data/processed/fastqc_results/cleaned_default_settings -f fastq $(cat ../../data/processed/fastq_file_list.txt)\n",
-    "    \n",
-    "Finally we run our R-script which plots the results:"
+    "`secapr quality_check --input ../../data/processed/cleaned_trimmed_reads_default --output ../../data/processed/fastqc_results/cleaned_default_settings`"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 30,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -480,17 +385,17 @@
    "source": [
     "As we see above, running the script with default settings improved the file quality but there is a lot of room for improvement. After reviewing the intial quality reports and after trying a bunch of different flags and values, I ended up with this command for the example data. See the script documentation for more information about the different flags (`secapr clean_reads -h`).\n",
     "\n",
-    "    secapr clean_reads --input data/raw/fastq/ --config data/raw/adapter_info.txt --output data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
+    "    secapr clean_reads --input ../../data/raw/fastq/ --config ../../data/raw/adapter_info.txt --output ../../data/processed/cleaned_trimmed_reads --index single --simpleClipThreshold 5 --palindromeClipThreshold 20 --seedMismatches 5 --headCrop 10\n",
     "\n",
-    "For producing the plots you repeat the commands from above (3. Clean reads with `secapr` (default settings))."
+    "Let's check the final quality of the data:\n",
+    "\n",
+    "    secapr quality_check --input ../../data/processed/cleaned_trimmed_reads --output ../../data/processed/fastqc_results/custom_default_settings"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 7,
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "outputs": [
     {
      "name": "stdout",
@@ -559,7 +464,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.0"
+   "version": "3.6.4"
   }
  },
  "nbformat": 4,