Context and Purpose: This document presents a shell scripting problem set for a Bioinformatics course (Programming for Bioinformatics, Module 3, AY 2019-2020). It is designed to assess students' ability to use command-line utilities for manipulating biological data.
Provided Data: The exercises are based on a `fastq.txt` file, which contains data in the standard FASTQ format, including sequence identifiers (starting with '@'), DNA sequences, a '+' separator line, and quality scores.
Exercise 1: Data Extraction and Counting with `cut`, `sort`, `tail`, `wc`:
Part (a): Students must determine the output of `cat fastq.txt | cut -c 1-3 | sort -u | tail -5 | wc -l`. This sequence of commands extracts the first three characters from each line, sorts them uniquely, selects the last five unique entries, and then counts the number of resulting lines.
Part (b): Students need to predict the output of `cat fastq.txt | cut -c 1-3 | tail -5 | sort -u | wc -l`. This variation highlights the importance of command order, as `tail -5` is applied before `sort -u`, affecting the set of lines processed for unique sorting and counting.
Exercise 2: Pattern Matching and Counting with `grep` and `wc`: This question asks for the output of `cat fastq.txt | grep "^..B" | wc -l`. The `grep` command filters lines that begin with any two characters followed by the letter 'B', and `wc -l` then counts these matching lines, testing basic regular expression understanding.
Exercise 3: Advanced Pattern Matching and Counting with `awk`: Students are tasked with predicting the output of `cat fastq.txt | awk 'BEGIN{n=0} $1 ~ /^..B/{if($1 ~ /@/) n++}END{print n}'`. This `awk` script initializes a counter `n`. For each line, it checks if the first field (`$1`) matches the pattern `^..B`. If true, it further checks if the same first field contains an '@' symbol. If both conditions are met, `n` is incremented. Finally, the script prints the total count `n`. This evaluates conditional logic and field-specific pattern matching within `awk`.
Exercise 4: Line Deletion and Counting with `sed` and `wc`: The final task involves predicting the output of `cat fastq.txt | sed -e '/^..B/d' | sed -e '/@/d' | wc -l`. This command pipeline uses `sed` twice: first to delete lines starting with `^..B`, and subsequently to delete any remaining lines containing the '@' symbol. `wc -l` then counts the lines that were not deleted by either `sed` command, demonstrating sequential filtering and deletion.
Siamo nati da poco ma abbiamo già migliaia di appunti nella nostra community!
Completa il tuo profilo
Adesso sei dei nostri!
Ottieni i primi crediti!
Carica i tuoi file
Il modo più veloce per guadagnare crediti è caricare materiale.
Ci sono tante tipologie di materiale e siamo certi che hai tanto valore da condividere con la community!
Accidenti, ancora non abbiamo il tuo corso di laurea!
Se ti va puoi inserirlo tu in pochi click — anche solo il corso di laurea, oppure completo di tutti i corsi!
Aggiungilo subito
e faremo del nostro meglio per popolarlo di materiale interessante.
Nel frattempo inizia a guadagnare crediti invitando i tuoi amici, così appena saremo attivi potrai subito accedere al materiale disponibile.
Bastano 3 amici verificati per attivare l'abbonamento…
Consiglia ai tuoi amici
Scrivi ai tuoi vecchi amici o ai tuoi nuovi colleghi di studio. Ogni email che inserisci rappresenta un mattone importante per la community.
Per ogni amico che porti otterrai nuovi crediti!