Upload
others
View
9
Download
0
Embed Size (px)
Biol 59500-033 - Practical Biocomputing 1
Biology 59500-033 – Practical Biocomputing
Michael Gribskov
Hock 331
x46933
Biol 59500-033 - Practical Biocomputing 2
Introduction
Goals
• Basic skills in acquiring, transforming, and handling data
• Understand what is hard and what is easy to do with computers
• Basic introduction to good programming practices
• Not…
○ A bioinformatics course per se
○ Designed for professional programmers
Biol 59500-033 - Practical Biocomputing 3
Introduction
Course Wiki
• Because auditors to not have easy access to blackboard I use a Wiki
for course materials.
• Go to https://wiki.itap.purdue.edu/display/wl49402201720/spring-2017-biol-59500-033+Home
Biol 59500-033 - Practical Biocomputing 4
Introduction
Survey - To help me target the course to your needsFill this in on the class roster on the wiki to confirm that you can
access and edit the wiki https://wiki.itap.purdue.edu/display/wl49402201720/Student+Roster
• Name
• What kind of computer do you use (Mac, Windows, UNIX, Linux etc)
• What is your programming background – None, or language and
level of expertise
• What is your major and/or area of interest, this could included a
short description of a problem that made you want to take this
course. If you are an auditor (not taking the course for credit) please
note it in this section.
Biol 59500-033 - Practical Biocomputing 5
Introduction
Overall Schedule
• 6-8 weeks Perl Programming
○ Basics of Computers and Programming
○ Basics of Perl
○ Regular expressions and text processing
○ Writing user agent or robot scripts
• 3 weeks Databases
• 3 - 4 weeks Putting it all together / Advanced topics
Biol 59500-033 - Practical Biocomputing 6
Introduction
Effort / Work Required
• Weekly Problem Sets
• Sporadic Quizzes
• One or Two Midterm Exams
• Final Project
○ A working website or script that combines data manipulation and
computing to do something novel
Biol 59500-033 - Practical Biocomputing 7
Introduction
Texts
• Online texts
○ Safari – SAMS Teach Yourself Perl in 24 Hours, 3rd ed
− Available on Purdue Safari,
http://proquestcombo.safaribooksonline.com/
○ Paper texts (many available, not required)
Biol 59500-033 - Practical Biocomputing 8
Introduction
Additional Online Texts
• I've posted some additional texts gathered from the internet on the
Wiki. If you averse to buying a book, have a look at these and see
what you think. These are listed on the Wiki.
Biol 59500-033 - Practical Biocomputing 9
Introduction
Week 1 Goals
• Get Perl up and running somewhere you have ready access
○ Your personal PC
○ Lab computer
○ Computer lab
○ Genomics Computing Facility
• Write some simple scripts that actually do something
Biol 59500-033 - Practical Biocomputing 10
Introduction
Reading for this week
• Perl in 24 Hours (P24h)
○ Wednesday 1/11
− Hour 1 - Installing perl etc.
− Hour 2 - Variables and operators
○ Friday 1/13
− Hour 3 - Program flow
• Programming Perl (PP)
○ Wednesday 1/11
− Ch 5 - Creating and running a perl program
− Ch 6 - Perl variables, pg 23-29
− Ch 7 – Operators, pg 43-50
○ Friday 1/13
− Ch 8 – Conditional constructs
Biol 59500-033 - Practical Biocomputing 11
Introduction
Where to Run Perl
• Versions: Perl 5.8-5.18 are OK, earlier versions are not
• Options
○ Install on your own PC (instructions posted Wiki)
○ Install on your lab computer
○ Use Genomics Computing Facility Unix computers
− Get account/password from instructor
○ Use ITAP labs or RCAC clusters
Biol 59500-033 - Practical Biocomputing 12
Introduction
History
• PERL was invented by Larry Wall in 1987 as Practical Extraction and
Report Language
• Gained Popularity with the advent of the world wide web
○ PERL v4 – 1993
○ v5 – 1995, current version 5.22
○ Perl 6 is essentially a different language
• Widely used for CGI scripts, and processing electronic documents
○ "the Swiss Army chainsaw of programming
languages"
○ The Duct tape of the Internet
○ Most common scripting language used in genomics/bioinformatics
Biol 59500-033 - Practical Biocomputing 13
Introduction
Basics of Perl
• Perl is an interpreted language. A Perl script can run on any
computer with a Perl interpreter. Other interpreted languages
include python and java.
• Perl replaces many “shell” scripting tools used in UNIX such as sed
and awk
• Designed to be practical and useful rather than elegant, minimal, or
as an example of a theoretical philosophy
• Perl maxims
○ TMTOWTDI (Tim Toady) – There’s more than one way to do it
○ DWIM - Do what I mean or the "principle of least astonishment"
○ “What is the sound of Perl? Is it not the sound of a wall that people have
stopped banging their heads against?” – Larry Wall
Biol 59500-033 - Practical Biocomputing 14
Introduction
Perl Pros and Cons
• Pros
○ Powerful
○ Expressive
○ Easy and flexible to use
○ String and list processing
○ Automatic memory management
• Cons
○ Perl can be ugly (punctuation resembles cartoon cursing)
○ Perl can be excessively complex and compact, leading to unreadable
code
○ Perl is not efficient at highly mathematical operations
○ Many computer scientists look down on Perl
Biol 59500-033 - Practical Biocomputing 15
Am I Cut Out for Programming?
Programming Skills
• Logical thinking
• Detail oriented
• Able solve problems based on incorrect results
• Laboratory biologists are natural programmers!
Biol 59500-033 - Practical Biocomputing 16
Biological Protocol
Reagents
• SOB ( Super Optimal Broth ) WB ( Washing Buffer )
○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)
○ 0.5% w/v Yeast extract 90% distilled water
○ 10mM NaCl chilled to 4°C
○ 2.5mM KCl
Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB
• Grow cells with vigorous aeration overnight at 37°C.
• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.
• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.
• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.
• Repeat 2X
○ Resuspend the cell pellet in 250 ml of WB.
○ Centrifuge the cell suspension at 5,000 RPM for 15 min .
○ Carefully pour off the supernatant as soon as the rotor stops.
○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.
• Resuspend the cell pellet in 1 ml WB.
• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.
Biol 59500-033 - Practical Biocomputing 17
Biological Protocol
Reagents
SOB ( Super Optimal Broth ) WB ( Washing Buffer )
○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)
○ 0.5% w/v Yeast extract 90% distilled water
○ 10mM NaCl chilled to 4°C
○ 2.5mM KCl
Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB
• Grow cells with vigorous aeration overnight at 37°C.
• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.
• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.
• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.
• Repeat 2X
○ Resuspend the cell pellet in 250 ml of WB.
○ Centrifuge the cell suspension at 5,000 RPM for 15 min .
○ Carefully pour off the supernatant as soon as the rotor stops.
○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.
• Resuspend the cell pellet in 1 ml WB.
• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.
Definitions
Biol 59500-033 - Practical Biocomputing 18
Biological Protocol
Reagents
• SOB ( Super Optimal Broth ) WB ( Washing Buffer )
○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)
○ 0.5% w/v Yeast extract 90% distilled water
○ 10mM NaCl chilled to 4°C
○ 2.5mM KCl
Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB
• Grow cells with vigorous aeration overnight at 37°C.
• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.
• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.
• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.
• Repeat 2X
○ Resuspend the cell pellet in 250 ml of WB.
○ Centrifuge the cell suspension at 5,000 RPM for 15 min .
○ Carefully pour off the supernatant as soon as the rotor stops.
○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.
• Resuspend the cell pellet in 1 ml WB.
• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.
Repeat for a certain time,
or number of times, or
until a condition is met
Biol 59500-033 - Practical Biocomputing 19
Biological Protocol
Reagents
• SOB ( Super Optimal Broth ) WB ( Washing Buffer )
○ 2% w/v bacto-tryptone 10% redistilled glycerol (v/v)
○ 0.5% w/v Yeast extract 90% distilled water
○ 10mM NaCl chilled to 4°C
○ 2.5mM KCl
Protocol• Use a fresh colony of DH5α to inoculate 5 ml of SOB
• Grow cells with vigorous aeration overnight at 37°C.
• Dilute 2.5 ml of cells into 250 ml of SOB in a 1 liter flask.
• Grow with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.
• Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min.
• Repeat 2X
○ Resuspend the cell pellet in 250 ml of WB.
○ Centrifuge the cell suspension at 5,000 RPM for 15 min .
○ Carefully pour off the supernatant as soon as the rotor stops.
○ Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time and repeat step 6.
• Resuspend the cell pellet in 1 ml WB.
• Cells can be used immediately or frozen in 0.2 ml aliquots at -70°C.
Conditional or
alternative steps
Final result (output)
Biol 59500-033 - Practical Biocomputing 20
What is a program (script)
A detailed set of instructions for how to do something
• Definitions/symbols
○ pi = 3.1416
• Actions
○ Area = pi x radius2
• Loops
○ Repeat 2 times
○ Repeat until …
• Conditional
○ If (something) do (something)
Computers are like very hard working but very stupid lab helpers• Instructions must be exact – computers are quite happy to do the wrong
thing over and over
• All possible alternatives must be covered – when undefined situations occur computers either
○ Do the wrong thing
○ Stop and wait (forever)
○ Fail catastrophically (shoot themselves)
Biol 59500-033 - Practical Biocomputing 21
A More Detailed Protocol
Preparation of E. coli cells for electroporation
1. Use a fresh colony of DH5α (or other appropriate host strain) to inoculate 5 ml of SOB (without magnesium) medium in a 50 ml sterile conical tube. Grow cells with vigorous aeration overnight at 37°C.
2. Dilute 2.5 ml of cells into 250 ml of SOB (without magnesium) in a 1 liter flask. Grow for 2 to 3 hours with vigorous aeration at 37°C until the cells reach an OD550 = 0.8.
3. Harvest cells by centrifugation at 5000 RPM in a GSA rotor for 10 min in sterile centrifuge bottles. (Make sure you use autoclaved bottles!).
4. Wash the cell pellet in 250 ml of ice-cold WB as follows. First, add a small amount of WB to cell pellet; pipet up and down or gently vortex until cells are resuspended. Then fill centrifuge bottle with ice cold WB and gently mix. NOTE-the absolute volume of WB added at this point is not important.
5. Centrifuge the cell suspension at 5,000 RPM for 15 min and carefully pour off the supernatant as soon as the rotor stops. Cells washed in WB do not pellet well. If the supernatant is turbid, increase the centrifugation time.
6. Wash the cell pellet a second time by resuspending in 250 ml of sterile ice-cold WB using the same technique described above. Centrifuge the cell suspension at 5000 RPM for 15 min.
7. Gently pour off the supernatant leaving a small amount of WB in the bottom of the bottle. Resuspend the cell pellet in the WB - no additional WB needs to be added – and the final volume should be about 1 ml. Cells can be used immediately or can be frozen in 0.2 ml aliquots in freezer vials using a dry ice-ethanol bath. Store frozen cells at -70°C.
Biol 59500-033 - Practical Biocomputing 22
Introduction
Virtues of a programmer …
• Laziness - Makes you write labor-saving programs that other people
will find useful, and document what you wrote so you don't have to
answer so many questions about it.
• Impatience - The anger you feel when the computer is being lazy.
This makes you write programs that don't just react to your needs,
but actually anticipate them. Or at least pretend to.
• Hubris - Excessive pride. Also the quality that makes you write (and
maintain) programs that other people won't want to say bad things
about.
… In constrast to Cowboy Programming• galloping off on one's own without a prior plan (the runaway one-
liner)
• unnecessarily dense, unreadable code (False Hubris)
• reinventing the wheel unnecessarily (False Impatience)
• brute-force programming (False Laziness)
Biol 59500-033 - Practical Biocomputing 23
Computer Architecture
CPU
Input/Output
keyboard display diskUSB network
MemoryCache
Main
Fast Storage
Slow StorageEven Slower
Storage
Excruciatingly
Slow Storage
memory
Biol 59500-033 - Practical Biocomputing 24
Operating System
Computer Architecture
CPU
Input/Output
keyboard display disk network
MemoryCache
Main
Perl Program
Stored on disk
Executed by perlinterpreter
memory
Biol 59500-033 - Practical Biocomputing 25
Operating System
Computer Architecture
CPU
Input/Output
keyboard display diskUSB network
MemoryCache
Main
Perl Program
Stored on disk
Executed by perlinterpreter
Perl Interpreter
Reads perl script and carries out instructions
memory
Biol 59500-033 - Practical Biocomputing 26
Introduction
Basics
• Terminology
• Getting information into and back from a program
• Arithmetic
• Doing things repeatedly (looping)
• Making decisions (true or false?)
Biol 59500-033 - Practical Biocomputing 27
Introduction
Basics
• Variables – Storage – named so that values can be assigned and
accessed by symbols
• Operators – perform various operations on variables (e.g., + - * = )
• Expressions – A piece of Perl code that evaluates to a result (don’t
worry for now exactly what this means)
• Statements – variables + operators
• Functions – segments of programs that are reused. Predefined
functions look like parts of the language
• Programs / Scripts – series of statements
• Algorithms – Not the main focus of this course. Abstract
descriptions of how to accomplish a task – implemented as
programs
Biol 59500-033 - Practical Biocomputing 28
Basics of Programming
Simple Perl Programs
• Comments begin with #
• Every non-comment line ends with a semicolon;
• Comments are very important because they provide a place to
explain what the program does
# this is a comment
# this is test program
Print "testing";
Biol 59500-033 - Practical Biocomputing 29
Basics of Programming
Simple Perl Programs
• To get started, a few informal elements. We’ll come back to these
formally later
• Simple variable names begin with $ (technically called scalar
variables)
• A simple program
$one = 1;
$one = $one + 1;
# a simple adding program
$one = 1;
$two = 2;
$sum = $one + $two;
Biol 59500-033 - Practical Biocomputing 30
Basics of Programming
Simple Perl Programs
• Input operator <>
○ Reads a line of input
• Print function
○ Nothing is printed until you print a carriage return
○ use \n to generate a carriage return
print "any text inside quotes\n";
$in = <>;
Biol 59500-033 - Practical Biocomputing 31
Basics of Programming
Simple Perl Programs• While loop
○ while executes a block of code (inside the curly braces) until the
condition in the parentheses becomes false (1=true, 0=false).
while ( 1 ) {
# anything here repeats forever
}
# program echo:
# repeat terminal input back to the display
while ( 1 ) {
$in = <>;
print $in;
}
Biol 59500-033 - Practical Biocomputing 32
Basics of Programming
Simple Perl Programs
• Where does <> read from?
• If there is nothing else on the command line, input is from the
keyboard% echo.pl
% perl echo.pl
• If a file is provided, input is from the file (you will see nothing unless
you have the script print)% echo.pl file.txt
% echo.pl <file.txt
% perl echo.pl file.txt
# program echo
# repeat terminal input back to the display
while ( $in = <> ) {
print $in;
}
Biol 59500-033 - Practical Biocomputing 33
Basics of Programming
Simple Perl Programs• Read number from a file and sum
> sample.pl data1.in
> perl sample.pl data1.in
# program sample1.pl
# Calculate the sum of a list of number in a file
while ($in = <> ) {
print $in;
$sum = $sum + $in;
$n_values = $n_values + 1;
}
print "There are $n_values values in the file\n";
print "The sum is $sum.
3
4.5
-1
6
7.3
8.11
data1.in
(partial)
Biol 59500-033 - Practical Biocomputing 34
Basics of Programming
Simple Perl Programs
• Conditional operators – test whether something is true or false
• Simple conditional tests
○ == tests whether two numbers are the same
○ eq tests whether two strings are the same
if ( some_condition ) {
# this executes if true
}
$one = 1;
$two = 2;
$two == $one + $one; # true
$two == $one; # false
"me" eq "you"; # false
Biol 59500-033 - Practical Biocomputing 35
Basics of Programming
Simple Perl Programs
• Read number from a file and sum
Ignore negative values (maybe this is the way I mark missing values)
# program sample1.pl
# Calculate the sum of a list of number in a file,
# skip any -1 values
while ($in = <> ) {
if ( $in == -1 ) {
print $in;
$sum = $sum + $in;
$n_values = $n_values + 1;
}
}
print "There are $n_values values in the file\n";
print "The sum is $sum.
Biol 59500-033 - Practical Biocomputing 36
Before next class
• Identify a location where you can run perl
○ Already installed on your personal or lab computer
○ ITAP or RCAC lab
○ Download and install on your own computer (instructions on wiki)
• Check the version and make sure it is ≥ 5.8
○ type perl -V to find out the version
• Verify that you can create a file containing a perl script and run it
(see next page for suggestions)
○ perl test.pl should work everywhere
○ test.pl should work if configured to recognize .pl suffix
○ Windows – use a command window
○ Mac – use a terminal window
• It is essential that you confirm that you are able write and run perl
scripts as soon as possible. If you have difficulty it is usually easy
to solve – don't just assume it will work, actually try it
Biol 59500-033 - Practical Biocomputing 37
Before next class
• Try the following to make sure you understand.
Write a script that …
○ prints out a message, the traditional message is "hello world" but you
may prefer something else such as "Mr. Watson come here I want you."
○ prints the square of each number you enter on the keyboard
○ print the numbers in the Fibonacci series up to a value specified in a
variable
− a little harder – read the variable from the keyboard
○ calculates the sum of a series of numbers you enter from the keyboard
− hmm, how to make it stop and give the answer
• This is a learn by jumping in the deep end experience. All of these
examples are fairly easy, but may cause you some difficulty if this is
your first programming experience.
• Don't spend too long on these (remember, Perl is the sound of
people not beating their heads against the wall), but do make a list of
questions about what you don't understand. We will discuss these
in class.
Biol 59500-033 - Practical Biocomputing 38
January 11
Today – Hour 2• Scalar variables
• Numeric and assignment operators
• Conditional operators
• Operator precedence
• Style
Friday• Logical expressions
• Logical Statements
• Looping
• Lists
• Reading for Friday
○ P24H
− Hour 3 - Controlling the Program's Flow
○ PP
− ch 8: 57-73 - Conditional Constructs
Biol 59500-033 - Practical Biocomputing 39
Basic Perl
Scripts / Programs
• A perl script or program is a series of statements
• A statement is made up of variables and operators
○ Variables – a symbolic name that refers to a value stored in memory
perl scalars always begin with $ (sometimes called a sigil)
○ Operator – an action that is used to modify a variable
○ Statements are normally terminated by a ; (semicolon)
○ The term expression is sometimes used to refer to a fragment of code
that evaluates to a result
22
3.14
a
aeiou
this is a string
$x
$pi
$letter
$vowel
$label
Memory Variable
$x = $pi * $r**2; A simple statement with three variables and three operators
Biol 59500-033 - Practical Biocomputing 40
Basic Perl
Expressions
• Expressions are fragments of code that evaluate to a result
• Expressions are often used with the assignment operator (=) to
assign values to variables, e.g.,
○ $x = $y + 1; # y + 1 is an expression
• Understanding the logical (true/false) value of expressions is critical
to making decisions and using loops with comparisons such as
○ < (less than)
○ > (greater than)
○ == && etc.
• Because Perl does not distinguish much between strings (text) and
numbers, context is important in determining the logical value
Biol 59500-033 - Practical Biocomputing 41
Basic Perl
Scalar Variables – Numbers and Strings
• Scalar variables correspond to single numbers or strings of
characters (as opposed to arrays, matrices or other collections)
○ Numbers – unlike some languages, Perl does not distinguish between
integer, floating point, and scientific notation
○ Strings – a piece of text; one or more letters including spaces,
punctuation, digits, and nonprinting characters such as tabs, returns,
form feeds, etc.
$x = 12000;
$y = 12000.00;
$z = 1.2e4;
$name = "Gribskov";
$alphabet = "abcdefghijklmnopqrstuvwxyz";
$space = " ";
$nothing = "";
Biol 59500-033 - Practical Biocomputing 42
Basic Perl
Scalar Variables - Strings
• Strings – a piece of text
• Double quotes or single quotes?
○ Double quotes are interpolating quotes, any Perl variables in the string
are replaced by their values. Perl variables are detected by their sigils
($, @, %)
○ Single quotes are non-interpolating quotes, apparent Perl variables are
retained exactly as written
$x = 12;
$name = "adam";
$name = "adam-$x"; # value of $name is adam-12
$name = 'adam-$x'; # value of $name is adam-$x
Biol 59500-033 - Practical Biocomputing 43
Basic Perl
Scalar Variables – Numeric operators• For convenience we can refer to an expression with an operator as
having a left hand side (lhs) and a right hand side (rhs)
$x + 1 (with respect to +) $y = $x + 1 (with respect to =)
• Arithmetic operators
+ plus
- minus
/ divide
* multiply
** exponentiate (not ^)
% modulo (remainder) 7 % 2 is 1
• Assignment operators
= assignment, $x = 1
+= increment, $x += 12 same as $x = $x + 12
-= decrement, $x -= 12 same as $x = $x1 - 12
++ autoincrement, $x++ same as $x = $x + 1
-- autodecrement, $x-- same as $x = $x - 1
Biol 59500-033 - Practical Biocomputing 44
Basic Perl
Assignment Operators
• Store a new value for a variable
○ a new value is placed in memory, destroying any previously stored
value
• Newbies sometimes find statements like the following confusing
$x = $x + 1
Assignment does the following
1. Evaluate the rhs 22 + 1 is 23
2. replace the current value of the lhs with the new value
22
3.14
a
aeiou
this is a string
$x
$pi
$letter
$vowel
$label
Memory Variable
23
3.14
a
aeiou
this is a string
$x
$pi
$letter
$vowel
$label
Memory Variable
Biol 59500-033 - Practical Biocomputing 45
Basic Perl
Assignment Operators -+= and -=
• Shorthand for common assignments involving addition and
subtraction
○ $x = $x + 4; $x += 4;
$y = $y – 7; $y -= 7;
○ no space between + and =, or – and =
• Also *=, /=, **= etc by the same principle
Biol 59500-033 - Practical Biocomputing 46
Basic Perl
Scalar Variables
• Autoincrement and autodecrement (++ and --)
○ $x++; and $x--; are shorthand for $x = $x + 1; and $x = $x – 1;
○ position of the operator before or after the variable determines whether
increment/decrement happens before or after the statement on the line
executes.
THESE CONSTRUCTIONS ARE ERROR PRONE – AVOID THEM
$x = 12;
print ++$x;
print "\n";
print $x++;
print "\n";
print --$x;
print "\n";
print $x--;
print "\n";
$y = $x++; # easy to miss the increment
$y = ++$x; # more cryptic
while ( $x++ < 10 )
while ( ++$x < 10 ) # are these the same?
Biol 59500-033 - Practical Biocomputing 47
January 13
Today• Logical expressions
• Logical Statements
• Looping
• Lists
Next week • Lists (arrays)
• Hashes
• Files
• Text processing
• Subroutines
Reading for today• 24H Hour
○ Hour 3 - Controlling the Program's Flow
○ Hour 4 - Stacking Building Blocks: Lists and Arrays
• PP
○ ch 6: 25-29 - Arrays
○ ch 8: 49-59 -Conditional Constructs
Biol 59500-033 - Practical Biocomputing 48
Homework 1
Why homework?
• Programming is like learning an instrument, you must practice or the concepts will just slip out of your memory
• Homework is posted on the Wiki each week
• There is no TA for this class so some standards are needed to make reviewing programs feasible
○ Homework will be graded primarily based on whether it produces the output it is supposed to. An example input and output will be provided for you to use in testing your code; a different input will usually be used for grading
○ Style and format count
○ I can only spend a limited amount of time figuring out why your program does not work, you must make sure it runs before submitting it.
○ Contact me if you can't make your script run after a reasonable amount of effort. Many times a script will not work because of some trivial typographical or syntactic problem. I can often find these quickly because I have made most of these mistakes myself in the past.
○ You can also try asking for help on the wiki. Likewise, feel free to answer questions on the wiki if you know the answer
Biol 59500-033 - Practical Biocomputing 49
Homework 1
Homework?
• Email completed homework to [email protected] with the
subject "biol59500 homework X" where X is replaced with the actual
homework number, e.g., "biol59500 homework 1".
• Include your script file as an attachment, not in the body of the mail.
Name the attached file with your last name and the homework
number, e.g., "huang_hw1", or "smith_hw17".
• The attached file should be able to be run as a script – it should have
no extraneous non-code content.
• Try to not embed filenames or paths in your script that will only work
on your computer. If you use these, your script will fail when I test
it.
Biol 59500-033 - Practical Biocomputing 50
Homework 1
fastq - sequencing read file• 4 lines per read
○ ID (begins with @)
○ Sequence
○ + separator
○ quality
• How many reads?
• What is the average length of a read?
Biol 59500-033 - Practical Biocomputing 51
Basic Perl
Scalar Variables• String operators
. (period) concatenation
.= append
x repeat; $a x $b repeats $a, $b times
index($a,$b) integer offset of string $b in string $a
substr($s,$o,$l) get substring $l long beginning at $o in string $s
length($s) length of string $s
$month = "Jan ";
$day = "twentieth ";
$year = "2006";
$month_day = $month.$day; # "Jan twentieth "
$full_date = $month; # "Jan "
$full_date .= $day; # "Jan twentieth "
$full_date .= $year; # "Jan twentieth 2006"
$six_a = "$a" x 6; # $six_a is "aaaaaa"
$strlen = length($full_date); # strlen is 18
$date = substr($full_date,0,13); # date is "Jan twentieth"
Biol 59500-033 - Practical Biocomputing 52
Basic Perl
Scalar Variables• Substring( expression, offset [,length] )
• Index( string, substring [, position] )
○ Return the offset of the first occurrence of substring in string after
position (or offset zero if not specified)
○ use index for getting the offset
• Length( string )
○ use for finding length of a string
$a = "you are the one";
$b = substr( $a, 4, 3 ); # "are"
$a = "you are the one";
$offset = index( $a, "are" );
$b = substr( $a, $offset, 3 ); # "are"
$a = "you are the one";
$length = length( $a ); # length is 15
Biol 59500-033 - Practical Biocomputing 53
Basic Perl
Scalar Variables
• Special characters in strings
\n newline or line feed
\r return
\t tab
\f form feed
\b backspace
• What if you need to use " or ' or \ in a string
○ Must be marked with the \ symbol to tell the Perl interpreter they are not
quotes around strings
− use \" \' \\
○ Commonly called "escaping"
• Special characters you probably won't use
\c control (ctrl)
\u force next character uppercase
\l force next character lowercase
\U force following characters to uppercase
\L force following characters to lowercase
\E end \U or \L
print "\"Don\’t do it\"\n"; # prints "Don't do it"
print ""Don't do it"\n"; # error
Bareword found where operator expected at aa.pl line 1, near """Don't"
(Missing operator before Don't?)
String found where operator expected at aa.pl line 1, near "do it"\n""
(Do you need to predeclare do?)
syntax error at aa.pl line 1, near """Don't "
Execution of aa.pl aborted due to compilation errors.
Biol 59500-033 - Practical Biocomputing 54
Basic Perl
Scalar Variables • Perl tries to do the right thing, even when you are syntactically
wrong (DWIM – do what I mean)
$a = "aaa";
$b = "bbb";
$ab = $a + $b; # operands are strings not numbers
print "ab = $ab\n"; # the numeric value of a string is zero
$one = "1";
$two = "2";
$three = $one + $two; # operands are strings that are integers
print "three = $three\n";
$one = "1"; # mixed characters and numbers
$two = 2;
$three = $one . $two; # concatenate strings
print "three = $three\n";
$three += 4;
print "three plus 4 = $three\n";
$one_a = $one + $a;
print "one_a = $one_a\n";
$a_three = $a . $three;
print "a_three = $a_three\n";
ab = 0
three = 3
three = 12
three plus 4 = 16
one_a = 1
a_three = aaa16
Biol 59500-033 - Practical Biocomputing 55
Basic Perl
Conditional (Logical) Operators
• Test whether something is true or false
○ Simple conditional tests
− == tests whether two numbers are the same
− eq tests whether two strings are the same
if ( some_condition ) {
# this executes if true
}
$one = 1;
$two = 2;
$two == $one + $one; # $two is true
$two == $one; # $two is false
"me" eq "you"; # false
"me" == "you"; # true, 0 == 0
Biol 59500-033 - Practical Biocomputing 56
Basic Perl
Operator Precedence • When a statement has more than one operator, there are rules that
define which operation is done first. This is called operator
precedence. Just as in basic algebra, the use of parentheses
overrules the default rules.
• Associativity Operators Precedence
non-associative ++ -- highest (applied first)
right **
left * / % x
left + - .
non-associative == eq
right = += -= *= etc lowest (last)
• Associativity
○ non-associative – applies only to the immediate operand
○ left – operations carried out left to right
○ right – operation carried out right to left
Biol 59500-033 - Practical Biocomputing 57
Basic Perl
Operator Precedence
• Examples
○ x + y + z = (x + y) + z
○ x + y * z = x + (y * z)
○ x**2 + y**2 * 4 = x2 + 4y2
○ x+=2 > y + z = x+= (2 > (y + z))
• Take home lesson: operator precedence is hard to remember
If in doubt, DON'T RELY ON IT. USE PARENTHESES
Associativity Operators
non ++ --
right **
left * / % x
left + - .
non < > <= >= gt lt ge
non == != <=> eq ne cmp
right = += -= *= etc
Biol 59500-033 - Practical Biocomputing 58
Style
What we want to avoid
• Programming style is supposed to
○ Save time and effort reusing your programs
○ Make your program easy to read
○ Make it easy to figure out what it does (validate)
○ Help prevent mistakes
○ Help find mistakes (debug)
• Style is partially convention and partially esthetics
$_='\=*Sxw!jds@j$.jl.dt#Rw%^dcn"K1x(=Bl1nwl!\*1enab^h"F=!J$h%fhcq',
tr&J-ZA-Ij-za-i&A-Za-z&&s&\(&logic&&&s&\*&un&g&s&=&al&g&s&\^&it&g&&
s&%&st&g&&s&\$&ber&g&s&\#&\n&&s&"& of&g,s&([A-Z])& $1&g&&s&\\u&U&&&
s&!&es, &g&s&\\a&A&&s&1&i&g&&print" $_\n";sub liminal{"use perl!";}
Biol 59500-033 - Practical Biocomputing 59
Style
General Rules
• Use space to emphasize parts of code
○ Use white space (spaces, tabs, etc) and blank lines to emphasize
segments that go together
○ is easier read than
− Use space inside braces {} and parentheses ()
− Use space around operators: +, -, *, /, =, ==, eq, etc
• Break lines at a page width (usually about 80 characters)
○ Why? So you can print it out if you want to
while ( $count < 3 ) {
# do some stuff
}
while ($count<3){#do some stuff
}
Biol 59500-033 - Practical Biocomputing 60
Style
General• Use indentation to indicate blocks of code
• Align parentheses so its easy to see where a block begins and ends
○ Preferred style (Kernigan and Richie “snuggled” style)
○ Other styles
while ( $forever ) { # Kernigan and Richie “snuggled” style
$text = <>; # read one line from terminal
print $text;
}
while ( $forever ) # GNU style
{
$text = <>; # read one line from terminal
print $text;
}
while ( $forever ) # BSD Style
{
$text = <>; # read one line from terminal
print $text;
}
Biol 59500-033 - Practical Biocomputing 61
Style
Variable names
Use mnemonic variable names• Why? Makes it much easier to understand what the program does
• $height, $weight, $age, $sex instead of $x1, $x2, $x3, $x4
Develop some consistent style for capitalization etc.• Why? So you don’t get confused
• Perl is case sensitive, $quaint is not the same as $Quaint
Use underlines and capitalization to improve readability• Improves naturalness of code – improves your ability to
understand and remember what it does
• $number_of_lines is easier to read than $numberoflineseither is better than $nl
• Alternative style: numberOfLines (I prefer this for function names)
Do not make variable names counterintuitive or misleading• Pretty obvious, don’t say $done when you mean $not_done
while ( $not_done ) {
# do some stuff
}
Biol 59500-033 - Practical Biocomputing 62
Style
Variable Names
• Variable names should be unique
○ Create a new variable for a new purpose: don’t just use one from an
earlier part of the program “because its there”
○ Reusing variable names is confusing because the same variable means
different things in different parts of the program
$c = 0;
$value = 1;
$threshold = 1000;
# find the largest power of 2 less than the threshold
while ( $value < $threshold) {
$c = $c + 1;
$value = $value * 2; # next power of 2
}
# display result and find out if we should continue
print “enter a new set of thresholds, ending with 0\n”;
$more_numbers = 1;
while ( $more_numbers ) {
$c = <>;
...
$more_numbers = $c eq “0”; # was the entered number zero?
}
Biol 59500-033 - Practical Biocomputing 63
Style
Document your code
• To avoid writing the same scripts over and over, you have to be able
to (quickly) figure out what an existing script does
• ALWAYS include comments to explain what the script does.
• Date helps you identify scripts that you used for a specific purpose
at a certain time (important when you discover bugs)
• Name is helpful if you give your scripts to your co-workers
while (1){ $a = <>; print $a; }
# echo.pl
#
# Get input from the terminal and echo to the display
#
# 11 January 2006 Michael Gribskov
#
$forever = 1;
while ( $forever ) {
$text = <>; # read one line from terminal
print $text;
}
Biol 59500-033 - Practical Biocomputing 64
Style
• Establish a style and stick to it
• Style is a matter of habit; good habits == better code
• More style as we go on
• Required for homework
○ Basic documentation at top of script
− Purpose of script and method (if not obvious)
− Author
− Date
○ Comments describing function of code segments
○ Indentation of code blocks
○ White space separating code segments
Biol 59500-033 - Practical Biocomputing 65
Basic Perl
Logical Expressions
• Give a result of false or true
• Operators
○ different for numbers and strings
○ Numeric test String test Meaning
== eq equal to
!= ne not equal to
> gt greater than
>= ge greater than or equal to
< lt less than
<= le less than or equal to
<=> cmp not equal to, signed result*
*don't worry about cmp and <=> for now
Biol 59500-033 - Practical Biocomputing 66
Basic Perl
Logical Expressions
• What is true?
○ True is anything that is not false
• Only four things are false
○ The number 0 (zero) is false.
○ The string "0" (again zero) is false.
○ The empty string ("" or '') is false.
○ The undefined value, undef , is false.
• Everything else is true.
Understanding what is true and what is false is
essential to understanding how decisions are made
Biol 59500-033 - Practical Biocomputing 67
Basic Perl
Logical Expressions
• Greater than – Less than
○ Common mistake is to assume that the limit will be included. This is
only true for >= and <=.
• Trying to do the right thing
$count = 0;
while ( $count < 3 ) { # FALSE when $count == 3
print "$count\n";
$count++;
}
$five = "5";
$ten = "10";
$result = $five < $ten;
$result = $five lt $ten;
# true, 5 < 10
# false, 1 sorts before 5 in strings
(alphabetical)
Biol 59500-033 - Practical Biocomputing 68
Basic Perl
Logical Operators
• To combine logical expressions you need
○ and && (AND)
○ or || (OR)
○ not ! (NOT)
○ the written forms (and,or,not) have low precedence
○ the symbolic forms ( &&, ||, ! ) have high precedence
AND ( and &&), true only when both operands are true
operand 1 operand 2 result
True True True
True False False
False True False
False False False
OR (or || ), true when either operand is true
operand 1 operand 2 result
True True True
True False True
False True True
False False False
Biol 59500-033 - Practical Biocomputing 69
Basic Perl
Logical Operators
• AND and OR are evaluated in drop dead fashion. Only as many
operands as need be checked are checked
• Important later when the operands may not be simple variables
• Every expression, numeric, string, or logical variable, has a logical
value.
$x = 1;
$y = 0;
if ( $y || ($x=2) ){
print "x: $x y:$y\n";
}
$x = 1;
$y = 0;
if ( $y or ($x=3) ){
print "x: $x y:$y\n";
}
Biol 59500-033 - Practical Biocomputing 70
Basic Perl
Operator Precedence
• Including logical operators
Associativity Operators
non-associative ++ --
right **
right !
left * / % x
left + - .
non-associative < > <= >= gt lt ge
non-associative == != <=> eq ne cmp
left &&
left ||
right = += -= *= etc
right not
left and
left or
Biol 59500-033 - Practical Biocomputing 71
Basic Perl
Logical Statements
• Used for making decisions
○ If / elsif / else
○ Always apply to a block of code delimited by { }
• If ( logical expression ) {
Block of code # executes only if expression is true
}
• If ( logical expression ){
Block of code # if expression is true
} else {
Block of code # if expression is false
}
Biol 59500-033 - Practical Biocomputing 72
Basic Perl
Logical Statements
• Making a series of comparisons with if / elsif / else
• If ( logical expression 1 ) {
Block of code # if expression 1 is true
} elsif ( logical expression 2 ) {
Block of code # if expression 2 is true
} else {
Block of code # if both expressions are
} # false
Biol 59500-033 - Practical Biocomputing 73
Basic Perl
Logical Statements• Some language have a special statement for making multiway
decisions (called a case statement)
• if / elsif / else is the closest thing to a Perl case statement
while ( $action ne "done\n" ) {
$action = <>;
if ( $action eq "add\n" ) {
} elsif ( $action eq "subtract\n" ) {
} elsif ( $action eq "divide\n" ) {
} elsif ( $action eq "multiply\n" ) {
} else {
print "I don\'t understand command $action\n";
}
}
Biol 59500-033 - Practical Biocomputing 74
Basic Perl
Logical Statements
• Unless - the opposite of if
○ if the logical expression is false, the block is executed
○ unless (logical expression) {
Block of code
}
while ( $action ne "done" ) {
$action = <>;
unless ( $action eq "quit" ) {
if ( $action eq "add" ) {
} elsif ( $action eq "subtract" ) {
} elsif ( $action eq "divide" ) {
} else {
print "I do not understand command $action\n";
}
}
}
Biol 59500-033 - Practical Biocomputing 75
Basic Perl
Logical Statements• unless
○ unless can also take an else clause
○ unless is more confusing than if
Use else sparingly with unless
unless ( $name eq "Frank" ) {
print "Hi $name\n";
} else {
print "Oh, it\'s you again, Frank\n";
exit;
}
# compared to
if ( $name eq "Frank" ) {
print "Oh, it\'s you again, Frank\n";
exit;
} else {
print "Hi $name\n";
}
Biol 59500-033 - Practical Biocomputing 76
Basic Perl
Logical Statements• Additional syntax for if and unless – one line tests
○ expression if ( logical condition );
○ expression unless ( logical condition );
• Can be more readable in some contexts
# these are the same
if ( x > 3 ) {
x = $y + 1;
}
$x = $y + 1 if ( x > 3 );
# these are the same
unless ( x > 3 ) {
x = $y + 1;
}
$x = $y + 1 unless ( x > 3 );
Biol 59500-033 - Practical Biocomputing 77
Basic Perl
Looping
• Looping allows a program to do something over and over, one of the
main reasons for using a program. Looping in Perl uses
○ while
○ do / while and do / until
○ foreach
○ for
• while
○ while ( logical expression ) {
Block of code
}
○ while loops test the condition before every execution of the loop.
○ If you want it to be tested after the loop use do … while or do … until.
$value=5;
while ( $value>10 ){
print "$value\n";
$value=$value-1;
}
Biol 59500-033 - Practical Biocomputing 78
Basic Perl
Looping• do … while
○ do {
Block of code
} while ( logical expression ); # continues if expression is TRUE
• do … until
○ do {
Block of code
} until ( logical expression); # continues if expression is FALSE
• Loop always executes at least once
• do loops test the condition before each subsequent execution of the
loop.
$value=5;
do{
print "$value\n";
$value = $value-1;
} while ( $value > 10 );
Biol 59500-033 - Practical Biocomputing 79
Basic Perl
Looping• foreach, executes once for each item in a list
○ Most common looping statement in Perl
○ temporary variable assumes the value of the list item at each iteration
○ foreach $tmp_variable ( list of values ) {
Block of code
}
○ foreach ( list of values ) {
Block of code
}$total = 0;
foreach $number( 1,2,3 ){
$total += $number;
print “$number $total\n”;
}
$cycle = 0;
foreach( 1 .. 3 ){
$cycle++;
print “cycle $cycle\n";
}
Biol 59500-033 - Practical Biocomputing 80
Basic Perl
Range Operator
• .. (two periods)
○ 1 .. 10
○ $i .. $j
○ $i .. $i + 10
• very handy for foreach loops
$total = 0;
foreach $number( 1 .. 3 ){
$total += $number;
print “$number $total\n”;
}
Biol 59500-033 - Practical Biocomputing 81
Basic Perl
Looping• for
○ familiar to C/C++ programmers, used much less in Perl
○ defines a temporary variable for use in loop with a specified initial value,
ending value, and increment for each iteration
○ for ( initial_expression; test_expression; change_expression) {
Block of code
}
○ Gives detailed control over begin, end, and step of the loop# Print numbers 1 to 99 by 2
for ( $i=1; $i<=100; $i+=2 ) {
print "$i\n";
}
Biol 59500-033 - Practical Biocomputing 82
Basic Perl
Looping• Break outs
○ Particularly useful with forever loops
○ next – stops the current iteration, proceeds with the next normal
iteration, i.e., skip things you don't want
○ last – stop the current iteration and immediately exit the loop
i.e., stop when you find what you want
$forever = 1;
# break out of loop using last
while ( $forever ) { # compare while ( 1 ) {
count++;
print "count is $count\n";
if ( $count > 3 ) {
last;
}
}
Biol 59500-033 - Practical Biocomputing 83
Basic Perl
Looping
• Breakouts are often most readable using the one line syntax for if
$forever = 1;
# skip processing using next
while ( $forever ) {
count++;
next if ( $count == 1 );
print "count is $count\n";
if ( $count > 3 ) {
last;
}
}