diff --git a/bash_class_assignment.ipynb b/bash_class_assignment.ipynb new file mode 100644 index 0000000..233cf95 --- /dev/null +++ b/bash_class_assignment.ipynb @@ -0,0 +1,393 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# `bash` practicals\n", + "\n", + "## Directory and file structure\n", + "\n", + "Using one command move to your home directory." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "cd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Change your directory on your local drive (mounted from virtualbox). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Confirm that you are in the correct location." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/malay\n" + ] + } + ], + "source": [ + "pwd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Download the data required for the class. The files are `split` into two part because of GitHub limitation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "wget https://github.com/cb2edu/CB2-101-BioComp/raw/2020/01-Linux_101/data/linux_data.tar.xz.partaa\n", + "wget https://github.com/cb2edu/CB2-101-BioComp/raw/2020/01-Linux_101/data/linux_data.tar.xz.partab" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check that there are two files in the directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ls" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Join the parts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cat linux_data.tar.xz.parta* > linux_data.tar.xz\n", + "ls" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We don't need the parts anymore." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rm linux_data.tar.xz.parta*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unzip the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "tar -xvJf linux_data.tar.xz" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using one command list the contents of the reference_data directory that is within the linux_data directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a new folder in `linux_data` called `selected_fastq`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Copy over the Irrel_kd_2.subset.fq and Mov10_oe_2.subset.fq from raw_fastq to the linux_lesson/selected_fastq folder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rename the `selected_fastq`folder and call it `exercise1`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wildcards" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Do each of the following using a single ls command without navigating to a different directory.\n", + "\n", + "1. List all of the files in /bin that start with the letter 'c'\n", + "2. List all of the files in /bin that contain the letter 'a'\n", + "3. List all of the files in /bin that end with the letter 'o'\n", + "4. BONUS: Using one command to list all of the files in /bin that contain either 'a' or 'c'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## History\n", + "\n", + "1. Checking the output of the history command, how many commands have you typed in so far?\n", + "2. Use the up arrow key to check the command you used before history command. What is it? Does it make sense?\n", + "3. Type several random characters on the command prompt. Can you bring the cursor to the start with Ctrl + A? Next, can you bring the cursor to the end with Ctrl + E? Finally, what happens when you use Ctrl + C?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Files\n", + "\n", + "**Do the following in the terminal**\n", + "\n", + "1. Change directories into genomics_data. You can do this using a full or relative path.\n", + "2. Use the less command to open up the file Encode-hesc-Nanog.bed.\n", + "3. Search for the string chr11; you'll see all instances in the file highlighted.\n", + "4. Staying in the less buffer, use the shortcut to get to the end of the file. \n", + "5. Exit the less buffer and come back to the command prompt.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Searching files\n", + "\n", + "1. Using `find` command search for the sequence file `Mov10_oe_1.subset.fq`.\n", + "2. Search for the sequence CTCAATGAGCCA in Mov10_oe_1.subset.fq. How many sequences do you find?\n", + "3. If you want to search for that sequence in **all** Mov10 replicate fastq files, what command would you use?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Searching and redirection\n", + "\n", + "How many unique exons are present on chromosome 1 using chr1-hg19_genes.gtf?\n", + "\n", + "1. Extract only the genomic coordinates of exon features\n", + "2. Subset dataset to only keep genomic coordinates\n", + "3. Remove duplicate exons\n", + "4. Count the total number of exons" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Shell scripts\n", + "\n", + "1. Write a script `listing.sh`. Add the command which prints to screen the contents of the file `Mov10_rnaseq_metadata.txt`.\n", + "2. Add an echo statement for the command, which tells the user \"This is information about the files in our dataset:\"\n", + "3. Run the new script. Report the contents of the new script and the output you got after running it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " ## Bash variables\n", + "\n", + "1. Use the `$file` variable as input to the head and tail commands, and modify the arguments to display only four lines from any file. \n", + "2. Create a new variable called meta and assign it the value Mov10_rnaseq_metadata.txt. For the following questions, use the $meta variable but do not change directories. Provide the code you would run to:\n", + " \n", + " a. Display the contents of the file using cat.\n", + " b. Retrieve only the lines which contain normal samples. (Hint: use grep)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " \n", + "## Filename\n", + "\n", + "1. How would you modify basename command above to only return Mov10_oe_1 from the filename `Mov10_oe_1.subset.fq`?\n", + "2. Use basename with the file Irrel_kd_1.subset.fq as input. Return only Irrel_kd_1 to the terminal." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `for` loop\n", + "\n", + "Write a loop to print out the number of lines in each fasta file in the dataset. The output should look something like this:\n", + "```\n", + " Irrel_kd_1.subset.fq 891684\n", + " Irrel_kd_2.subset.fq 767072\n", + " Irrel_kd_3.subset.fq 586196\n", + " Mov10_oe_1.subset.fq 1223600\n", + " Mov10_oe_2.subset.fq 1110016\n", + " Mov10_oe_3.subset.fq 690816\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Bash", + "language": "bash", + "name": "bash" + }, + "language_info": { + "codemirror_mode": "shell", + "file_extension": ".sh", + "mimetype": "text/x-sh", + "name": "bash" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}