44
Essential Skills for Bioinformatics: Unix/Linux

Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Essential Skills for Bioinformatics: Unix/Linux

Page 2: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

SHELL SCRIPTING

Page 3: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Overview

• Bash, the shell we have used interactively in this course, is a full-fledged scripting language. Unlike Python, Bash is not a general-purpose language.

• Bash is explicitly designed to make running and interfacing command-line programs as simple as possible. For these reason, Bash often takes the role as the glue language of bioinformatics, as it’s used to glue many commands together into a cohesive workflow.

Page 4: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Overview

• Note that Python is a more suitable language for commonly reused or advanced pipelines. Python is a more modern, fully featured scripting language than Bash.

• Compared to Python, Bash lacks several nice features useful for data-processing scripts: better numeric type support, useful data structures, better string processing, refined option parsing, availability of a large number of libraries, and powerful functions that help with structuring your programs.

• However, there’s more overhead when calling command-line programs from a Python script compared to Bash. Bash is often the best and quickest “glue” solution.

Page 5: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Writing and running bash scripts

• Most Bash scripts in bioinformatics are simply commands organized into a re-runnable script with some features to check that files exist and ensuring any error causes the script to abort.

• We will learn the basics of writing and executing Bash scripts, paying particular attention to how create robust Bash scripts.

Page 6: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

A robust Bash header

• By convention, Bash scripts have the extension .sh. You can create them in your favorite text editor (e.g. emacs or vi).

• Anytime you write a Bash script, you should use the following Bash script header, which sets some Bash options that lead to more robust scripts.

#!/bin/bashset –eset –uset –o pipefail

Page 7: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

A robust Bash header

• #!/bin/bashThis is called the shebang, and it indicates the path to the interpreter used to execute this script.

• set –eBy default, a shell script containing a command that fails will not cause the entire shell script to exit: the shell script will just continue on to the next line. We always want errors to be loud and noticeable. This option prevents this, by terminating the script if any command exited with a nonzero exit status.

Page 8: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

A robust Bash header

Note that this option ignores nonzero statuses in ifconditionals. Also, it ignores all exit statuses in Unix pipes except the last one.

• set –uThis option fixes another default behavior of Bash scripts: any command containing a reference to an unset variable name will still run. It prevents this type of error by aborting the script if a variable’s value is unset

Page 9: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

A robust Bash header

• set –o pipefailset –e will cause a script to abort if a nonzero exit status is encountered, with some exceptions. One such exception is if a program runs in a Unix pipe exited unsuccessfully. Including set –o pipefail will prevent this undesirable behavior: any program that returns a nonzero exit status in the pipe will cause the entire pipe to return a nonzero status. With set –eenabled, this will lead the script to abort.

Page 10: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Running bash scripts

• Running Bash scripts can be done one of two ways:1. bash script.sh2. ./script.sh

• While we can run any script, calling the script as an executable requires that it has executable permissions. We can set these using:

chmod u+x script.sh

• This adds executable permissions for the user who owns thefile. Then, the script can be run with ./script.sh.

Page 11: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Variables

• Processing pipelines having numerous settings that should be stored in variables. Storing these settings in a variable defined at the top of the file makes adjusting settings and rerunning your pipelines much easier.

• Rather than having to changes numerous hardcoded values in your scripts, using variables to store settings means you only have to change one value.

• Bash also reads command-line arguments into variables.

Page 12: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Variables

• Bash’s variables don’t have data types. It’s helpful to think of Bash’s variables as strings.

• We can create a variable and assign it a value with. results_dir=“results/”

• Note that spaces matter when setting Bash variables. Do notuse spaces around the equal sign.

Page 13: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Variables

• To access a variable’s value, we use a dollar sign in front of the variable’s name.

• Suppose we want to create a directory for a sample’s alignment data, called <sample>_aln/, where <sample> is replaced by the sample’s name.

sample=“CNTRL01A”mkdir “${sample}_aln/”

Page 14: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Command-line arguments

• The variable $0 stores the name of the script, and command-line arguments are assigned to the value $1, $2, $3, etc. Bash assigns the number of command-line arguments to $#.

Page 15: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Command-line arguments

• If you find your script requires numerous or complicated options, it might be easier to use Python instead of Bash. Python’s argparse module is much easier to use.

• Variables created in your Bash script will only be available for the duration of the Bash process running that script.

Page 16: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

if statement

• Bash supports the standard if conditional statement. The basic syntax is:

if [commands]

then

[if-statements]

else

[else-statements]

fi

Page 17: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

if statement

• A command’s exit status provides the true and false. Remember that 0 represents true/success and anything else if false/failure.

• if [commands][commands] could be any command, set of commands, pipeline, or test condition. If the exit status of these commands is 0, execution continues to the block after then. Otherwise execution continues to the block after else.

Page 18: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

if statement

• [if-statements] is a placeholder for all statements executed if [commands] evaluates to true (0).

• [else-statements] is a placeholder for all statements executed if [commands] evaluates to false. The else block is optional.

Page 19: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

if statement

• Bash is primarily designed to stitch together other commands. This is an advantage Bash has over Python when writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead to call programs.

• Although it can be unpleasant to write complicated programs in Bash, writing simple programs is exceedingly easy because Unix tools and Bash harmonize well.

Page 20: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

if statement

• Suppose we wanted to run a set of commands only if a file contains a certain string. Because grep returns 0 only if it matches a pattern in a file and 1 otherwise.

The redirection is to tidy the output of this script such that grep’s output is redirected to /dev/null and not to the script’s standard out.

Page 21: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

test

• Like other programs, test exits with either 0 or 1. However test’s exit status indicates the return value of the test specified through its arguments, rather than exit success or error. test supports numerous standard comparison operators.

Page 22: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

test

String/integer Description

-z str String str is null

str1 = str2 str1 and str2 are identical

str1 != str2 str1 and str2 are different

int1 –eq –int2 Integers int1 and int2 are equal

int1 –ne –int2 int1 and int2 are not equal

int1 –lt –int2 int1 is less than int2

int1 –gt –int2 int1 is greater than int2

int1 –le –int2 int1 is less than or equal to int2

int1 –ge –int2 int1 is greater than or equal to int2

Page 23: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

test

• In practice, the most common conditions you’ll be checking are to see if files or directories exist and whether you can write to them. test supports numerous file- and directory-related test operations.

Page 24: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

test

File/directory expression Description

-d dir dir is a directory

-f file file is a file

-e file file exists

-r file file is readable

-w file file is writable

-x file file is executable

Page 25: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

test

• Combining test with if statements is simple:if test –f some_file.txtthen

[…]fi• Bash provides a simpler syntactic alternative:if [ –f some_file.txt ]then

[…]fi• Note the spaces around and within the brackets: these are required.

Page 26: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

test

• When using this syntax, we can chain test expression with –a as logical AND, -o as logical OR, ! as negation. Our familiar && and || operators won’t work in test, because these are shell operators.

if [ “$#” –ne 1 –o ! –r “$1” ]

then

echo “usage: script.sh file_in.txt”

fi

Page 27: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

for loop

• In bioinformatics, most of our data is split across multiple files. At the heart of any processing pipeline is some way to apply the same workflow to each of these files, taking care to keep track of sample names. Looping over files with Bash’s for loop is the simplest way to accomplish this.

• There are three essential parts to creating a pipeline to process a set of files:

1. Selecting which files to apply the commands to2. Looping over the data and applying the commands3. Keeping track of the names of any output files created

Page 28: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

for loop

• Suppose we have a file called samples.txt that tells you basic information about your raw data: sample name, read pair, and where the file is.

Page 29: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

for loop

• Suppose we want to loop over every file, gather quality statistics on each and every file, and save this information to an output file.

• First, we load our filenames into a Bash array, which we can then loop over. Bash arrays can be created manually using:

Page 30: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

for loop

• But creating Bash arrays by hand is tedious and error prone. The beauty of Bash is that we can use a command substitution to construct Bash arrays.

• We can strip the path and extension from each filename using basename.

Page 31: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

for loop

Page 32: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Learning Unix

• https://www.codecademy.com/learn/learn-the-command-line• http://swcarpentry.github.io/shell-novice/• http://korflab.ucdavis.edu/bootcamp.html• http://korflab.ucdavis.edu/Unix_and_Perl/current.html• https://www.learnenough.com/command-line-tutorial• http://cli.learncodethehardway.org/book/• https://learnxinyminutes.com/docs/bash/• http://explainshell.com/

Page 33: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Sequence Alignments

Page 34: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

DATA FORMATS

Page 35: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

Overview

• Nucleotide (and protein) sequences are stored in two plain-text formats widespread in bioinformatics: FASTA and FASTQ.

• We will discuss each format and their limitations, and then see some tools for working with data in these formats.

Page 36: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTA

• The FASTA format originates from the FASTA alignment suites, created by William R. Pearson and David J. Lipman. The FASTA format is used to store any sort of sequence data not requiring per-base pair quality scores.

• This includes: reference genome files, protein sequences, coding DNA sequences (CDS), transcript sequences, and so on.

Page 37: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTA

• FASTA files are composed of sequence entries, each containing two parts: a description and the sequence data.

• The description line begins with a greater than symbol (>) and contains the sequence identifier and other optional information

• The sequence data begins on the next line after the description, and continues until there’s another description line.

Page 38: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTA

• An example FASTA file:

• The FASTA format’s simplicity and flexibility comes with an unfortunate downside: the FASTA format is a loosely defined ad hoc format.

Page 39: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTA

• In general, the following rules should be observed:1. Sequence lines should not be too long. While a FASTA file that

contains the sequence of the entire human chromosome 1 on a single line is a valid FASTA file, most tools that run on such a file would fail.

2. Some tools may accept data containing alphabets beyond those that they know how to deal with. For example, the standard alphabet for nucleotides would contain ATGC. An extended alphabet may also contain 1) N: A, T, G, or C2) W: A or T3) Search the web for “IUPAC nucleotides to get a list of all such symbols.

Page 40: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTA

3. The sequence lines should always at the same width with the exception of the last line. Some tools will fail to operate correctly and may not even warn the users if this condition is not satisfied. The following is technically a valid FASTA but it may cause various problems:

It should be reformatted to:

Page 41: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTA

4. Use upper-case letters. Whereas both lower-case and upper-case letters are allowed by the specification, the different capitalization may carry additional meaning and some tools and methods will operate differently when encoutering upper- or lower-case letters. Some communities (e.g. Ensembl) chose to designate the lower-case letters as all repeats and low complexity regions.

Page 42: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTQ

• The FASTQ format extends FASTA by including a numeric quality score to each base in the sequence. The FASTQ format is widely used to store high-throughput sequencing data, which is reported with a per-base quality score indicating the confidence of each base call.

• It is the de facto standard by which all sequencing instruments represent data.

Page 43: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTQ

• The FASTQ format looks like:

• Line1: The description line beginning with @. This contains the record identifier and other information.

• Line2: Sequence data, which can be on one or many lines.

• Line3: The line beginning with + indicates the end of the sequence.

• Line4: Quality data, which can also be on one or many lines, but must be the same length as the sequence. Each numeric base quality is encoded with ASCII characters.

Page 44: Essential Skills for Bioinformatics: Unix/Linux · 2017-07-14 · writing pipelines. Bash allows your scripts to directly work with command-line programs without requiring any overhead

FASTQ

• The FASTQ format is a multi-line format just as the FASTA format is. In the early days of high-throughput sequencing, instruments always produced the entire FASTQ sequence on a single line.

• The FASTQ format suffers from the unexpected flaw that the @ sign is both a FASTQ record separator and a valid value of the quality string. For that reason it is a little more difficult to design a correct FASTQ parsing program.