View
224
Download
0
Category
Preview:
Citation preview
8/4/2019 Perl for Bio in for Ma Tics
1/158
Programming for Computational Biology
Ian HolmesDepartment of Bioengineering
University of California, Berkeley
8/4/2019 Perl for Bio in for Ma Tics
2/158
Programming languages
Self-contained language
Platform-independent
Used to write O/S
C (imperative, procedural)
C++, Java (object-oriented)
Lisp, Haskell, Prolog (functional)
Scripting language
Closely tied to O/S
Perl, Python, Ruby
Domain-specific language
R (statistics)
MatLab (numerics)
SQL (databases)
An O/S typically manages
Devices (see above)
Files & directories
Users & permissions
Processes & signals
8/4/2019 Perl for Bio in for Ma Tics
3/158
Bioinformatics pipelines often involvechaining together multiple tools
8/4/2019 Perl for Bio in for Ma Tics
4/158
Perl is the most-used bioinformatics language
Most popular bioinformatics programming languages
Bioinformatics career survey, 2008
Michael Barton
8/4/2019 Perl for Bio in for Ma Tics
5/158
Pros and Cons of Perl
Reasons for Perls popularity in bioinformatics (Lincoln Stein)
Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text
Perl is forgiving
Perl is component-oriented Perl is easy to write and fast to develop in
Perl is a good prototyping language
Perl is a good language for Web CGI scripting
Problems with Perl Hard to read (theres more than one way to do it, cryptic syntax)
Too forgiving (no strong typing, allows sloppy code)
8/4/2019 Perl for Bio in for Ma Tics
6/158
Perl overview
Interpreted, not compiled Fast edit-run-revise cycle
Procedural & imperative
Sequence of instructions (control flow) Variables, subroutines
Syntax close to C (the de facto standard minimal language) Weakly typed (unlike C)
Redundant, not minimal (theres more than one way to do it )
Syntactic sugar
High-level data structures & algorithms
Hashes, arrays
Operating System support (files, processes, signals)
String manipulation
8/4/2019 Perl for Bio in for Ma Tics
7/158
Goals of this course
Concepts of computer programming
Rudimentary Perl (widely-used language)"How Perl saved the Human Genome Project" (Lincoln Stein)
Introduction to Bioinformatics file formats
Practical data-handling algorithms
Exposure to Bioinformatics software
8/4/2019 Perl for Bio in for Ma Tics
8/158
Structural elements Learning Perl, Schwartz et al
ISBN 0-596-10105-8 O'Reilly
"There's more than one way to do it
Q: But which is best? A: TESTS
Tests (above) supercede texts (below):
The main program The program outputFiles areshown in
yellow
FilenameStandard output streamTerminal input
Description of test conditions
Terminal session
8/4/2019 Perl for Bio in for Ma Tics
9/158
General principles of programming
Make incremental changes
Test everything you do
the edit-run-revise cycle
Write so that others can read it
(when possible, write with others)
Think before you write Use a good text editor
Good debugging style
8/4/2019 Perl for Bio in for Ma Tics
10/158
Perl for BioinformaticsSection 1: Scalars and Loops
Ian HolmesDepartment of Bioengineering
University of California, Berkeley
8/4/2019 Perl for Bio in for Ma Tics
11/158
Perl basics
Basic syntax of a Perl program:
# Elementary Perl program
print "Hello World\n";
"\n" means new line
print statement tells Perl to print the following stuff to the screen
Single or double quotes
enclose a "string literal"(double quotes are "interpolated")
All statements endwith a semicolon
Lines
beginningwith "#" are
comments,and are ignoredby Perl
Hello World
8/4/2019 Perl for Bio in for Ma Tics
12/158
Variables
We can tell Perl to "remember" a particularvalue, using the assignment operator =:
The $x is referred to as a "scalar variable".
Variable names can contain alphabetic characters, numbers(but not at the start of the name), and underscore symbols "_"
Scalar variable names are all prefixed with the dollar symbol.
$x= 3;
print $x;
3
$x= "ACGCGT";
print $x;
ACGCGT
Binding site for yeasttranscription factor MCB
8/4/2019 Perl for Bio in for Ma Tics
13/158
Arithmetic operations
Basic operators are + - / * %
Can also use += -= /= *=++ --
$x= 14;
$y = 3;
print "Sum: ", $x+$y, "\n";
print "Product: ", $x * $y, "\n";
print "Remainder: ", $x % $y, "\n";
Sum: 17
Product: 42Remainder: 2
$x= 5;
print "x started as $x\n";$x=$x * 2;
print "Then x was $x\n";
$x=$x+ 1;
print "Finally x was $x\n";
x started as 5
Then x was 10
Finally x was 11
Could write$x *= 2;
Could write$x+= 1;
or even++$x;
8/4/2019 Perl for Bio in for Ma Tics
14/158
String operations
Concatenation..=
Can find the length of a string using thefunction length($x)
$a = "pan";
$b= "cake";
$a =$a .$b;
print $a;
pancake
$a = "soap";
$b= "dish";
$a .=$b;
print $a;
soapdish
$mcb= "ACGCGT";
print "Length of $mcb is ",
length($mcb);Length of ACGCGT is 6
8/4/2019 Perl for Bio in for Ma Tics
15/158
More string operations
$x= "A simple sentence";
print $x, "\n";
print uc($x), "\n";
print lc($x), "\n";$y = reverse($x);
print $y, "\n";
$x=~ tr/i/a/;
print $x, "\n";
print length($x), "\n";
A simple sentenceA SIMPLE SENTENCE
a simple sentence
ecnetnes elpmis A
A sample sentence
17
Convert to upper case
Convert to lower case
Reverse the string
Transliterate "i"'s into "a"'s
Calculate the length of the string
8/4/2019 Perl for Bio in for Ma Tics
16/158
Concatenating DNA fragments
$dna1 = "accacgt";
$dna2 = "taggtct";
print $dna1 .$dna2;
"Transcribing" DNA to RNA
accacguuaggucu
$dna = "accACgttAGGTct";
$rna
=lc(
$dna
);
$rna =~ tr/t/u/;
print $rna;
Make it alllower case
DNA string is a mixtureof upper & lower case
Transliterate "t" to "u"
accacgttaggtct
8/4/2019 Perl for Bio in for Ma Tics
17/158
Comparison: variables in C are typed
C does not have a basic type for strings only individual characters.
Strings are built up from more basic elements as arrays of characters (well getto arrays later).
Much of this functionality is provided in C and C++ as part of the standard library.
8/4/2019 Perl for Bio in for Ma Tics
18/158
Conditional blocks
The ability to execute an action contingent onsome condition is what distinguishes a computerfrom a calculator. In Perl, this looks like this:if (condition) { action } else { alternative }
$x= 149;
$y = 100;
if ($x > $y)
{
print "$x is greater than $y\n";
}else
{
print "$x is less than $y\n";
}
149 is greater than 100
These braces { }
tell Perl whichpiece of code
is contingent onthe condition.
8/4/2019 Perl for Bio in for Ma Tics
19/158
Conditional operators
Numeric: > >= <
8/4/2019 Perl for Bio in for Ma Tics
20/158
Logical operators Logical operators: && means "and", || means "or"
An exclamation mark ! is used to negate what followsThus !($x < $y) means the same as ($x >=$y)
In computers, the value zero is often used to
represent falsehood, while any non-zero value(e.g. 1) represents truth. Thus:
if (1) { print "1 is true\n"; }
if (0) { print "0 is true\n"; }
if (-99) { print "-99 is true\n"; }
1 is true
-99 is true
$x= 222;
if ($x % 2 == 0 and $x % 3 == 0)
{ print "$x is an even multiple of 3\n"; }
222 is an even multiple of 3
8/4/2019 Perl for Bio in for Ma Tics
21/158
Loops
Here's how to print out the numbers 1 to 10:
This is a while loop.The code is executed while the condition is true.
$x= 1;
while ($x
8/4/2019 Perl for Bio in for Ma Tics
22/158
A common kind of loop
Let's dissect the code of the while loop again:
This form of while loop is common enough tohave its own shorthand: the forloop.
$x= 1;
while ($x
8/4/2019 Perl for Bio in for Ma Tics
23/158
Loops in C++ are similar to Perlcout is the standard output stream, part of the standard library.Used in C++ only (C has a complicated printf command)
8/4/2019 Perl for Bio in for Ma Tics
24/158
defined and undef
The function defined($x) is true if$x hasbeen assigned a value:
A variable that has not yet been assigned a
value has the special valueundef
Often, if you try to do something "illegal" (likereading from a nonexistent file), you end up withundef as a result
if (defined($newvar)) {
print "newvar is defined\n";
} else {
print "newvar is not defined\n";
}
newvar is notdefined
C does not have defined or undef. At best, using an uninitialized value will
cause a compiler error; at worst, it will lead to undefined behavior (i.e. disaster)
8/4/2019 Perl for Bio in for Ma Tics
25/158
Reading a line of data
To read from a file, we first need to openthe file and give it a filehandle.
Once the file is opened, we can read a
single line from it into the scalar$x :
This code snippet opens a file called"sequence.txt", and associates it witha filehandle called FILE
open FILE, "sequence.txt";
$x= ;This reads the next line from the file,including the newline at the end, "\n".
if the end of the file is reached, $x isassigned the special value undef
8/4/2019 Perl for Bio in for Ma Tics
26/158
Reading an entire file
The following piece of code reads everyline in a file and prints it out to the screen:
A shorter version of this is as follows:
open FILE, "sequence.txt";
while (defined ($x= )) {print $x;
}
close FILE;
open FILE, "sequence.txt";
while ($x= ) {
print $x;
}
close FILE;
This reads a line of data into$x, then checks if$x is defined.If$x is undef, then the file
must have ended.
this is equivalent todefined($x=)
8/4/2019 Perl for Bio in for Ma Tics
27/158
The default variable, $_
Many operations that take a scalar argument,such as length($x), are assumed to work on$_ if the $x is omitted:
So we can also read a whole file like this:
$_= "Hello";
print;
print length;
Hello5
open FILE, "sequence.txt";
while () {
print;
}
close FILE;
This line is equivalent towhile (defined($_=)) {
8/4/2019 Perl for Bio in for Ma Tics
28/158
Files in C++ are streams
8/4/2019 Perl for Bio in for Ma Tics
29/158
Debugging
Most programs don't work first time
Most apparently "working" programs
actually aren't Bugs are cryptic
Debugging is a scientific process
As you gain experience, you will begin to"insure" against bugs with yourprogramming technique
8/4/2019 Perl for Bio in for Ma Tics
30/158
Mars Climate Orbiter
Mars Climate Orbiter was the thirdspacecraft to be launched under the MarsSurveyor program to map & explore Mars
Around 2am PDT on September 23, 1998,the spacecraft disappeared behind Marsfollowing a maneouvre that should have
put it into Mars orbit This failure, along with a subsequent(unexplained) craft loss, cost NASA$327.6 million
8/4/2019 Perl for Bio in for Ma Tics
31/158
What was the problem?
Following a certain kind of engine burn, designed tostabilise the craft's angular momentum, the Orbiter sentdata to the ground station, so that its trajectory could berecalibrated (by a software module called SM_FORCE)
The Orbiter also internally recomputed its trajectoryfollowing a burn
The Orbiter's internal software module used metric units(Newton-seconds) while the ground station'sSM_FORCE module used Imperial (pound-seconds).The specification called for metric units
The maneouvre executed on September 23rd wastherefore computed using the wrong trajectory, takingthe Orbiter too low into Mars' atmosphere
8/4/2019 Perl for Bio in for Ma Tics
32/158
Why was the bug not detected?
The spacecraft periodically transmitted itscomputed trajectory to the ground station. Aquick comparison between the two trajectorieswould have revealed the error. However,
Other bugs in the SM_FORCE module prevented itsuse until 4 months into the flight
The ground crew weren't aware that trajectory datafrom the spacecraft were available
Discrepancies were noticed, but were only reportedinformally by email, and not taken seriously enough
i.e. incomplete testing; ignoring unexpectedresults; institutional complacency.
8/4/2019 Perl for Bio in for Ma Tics
33/158
Debugging is scientific
Finding bugs can be very frustrating
A job that you thought was nearly finished,
for which you have budgeted a certainamount of time, stretches out indefinitely
Often you may have no idea what's wrong
If you think of debugging as a scientificproblem and approach it systematically,much of the pain disappears
8/4/2019 Perl for Bio in for Ma Tics
34/158
The Process of Debugging
Step 1: Identify the Problem
observe it (e.g. because a test fails) reproduce it (so you can make it happen 100% of the time)
isolate it (strip it down to its bare essentials)
8/4/2019 Perl for Bio in for Ma Tics
35/158
The Process of Debugging
Step 2: Gather Information record all symptoms (disparate symptoms may be
related; if not, you should tackle them systematically one by one)
follow the flow of control of the program (manyways of doing this: e.g. you can use a "debugger" to watch thevariables; the time-honored method, and definitely the best, is toinsert debugging print statements into your code)
note recent changes (usually the cause of bugs)
look for similar problems (can ask other developers)
check "machine environment" (e.g. if you move to adifferent computer, does it have less memory? less disk space?)
8/4/2019 Perl for Bio in for Ma Tics
36/158
The Process of Debugging
Step 3: Form a Hypothesis try to isolate the code that causes the problem
e.g. strip away all "working" code that is not essential to reproducing
the bug if you can't find the bug, use a systematic "deletion" strategy (c.f.genetics!) until you have narrowed down the problem
what should that code be doing?
this can be seen as a continuation of Step 1
("identify the problem") debugging is a cyclic, interactive process
8/4/2019 Perl for Bio in for Ma Tics
37/158
The Process of Debugging
Step 4: Test Your Hypothesis
do not skip this step!
often the hypothesis will come to you in aflash of inspiration, but you still need to test it
for simple bugs, testing just means fixing the
problem for more complex bugs, you'll need to proceed
to the next steps...
8/4/2019 Perl for Bio in for Ma Tics
38/158
The Process of Debugging
Step 5: Propose a Solution
keep it minimal: try not to redesign all thecode unless this is absolutely necessary
then again, do not flinch from redesign if thisis what is called for
Step 6: Test the Solution also make sure you didn't break existing code
8/4/2019 Perl for Bio in for Ma Tics
39/158
Process of Debugging: Summary
Step 1: Identify the Problem
Step 2: Gather Information
Step 3: Form a Hypothesis Step 4: Test Your Hypothesis
Step 5: Propose a Solution
Step 6: Test the Solution
8/4/2019 Perl for Bio in for Ma Tics
40/158
Proactive debugging
Place consistency checks in your code also called assertions
Put comments in your code
this saves time when debugging Comment known (and fixed) bugs
keep a record of what you've fixed
Put log messages into your code
you can make these optional (e.g. comment themout); having them there can save lots of time
8/4/2019 Perl for Bio in for Ma Tics
41/158
Perl for Bioinformatics
Section 2: Sequences and Arrays
8/4/2019 Perl for Bio in for Ma Tics
42/158
Summary: scalars and loops
Assignment operator
Arithmetic operations
String operations Conditional tests
Logical operators
Loops defined and undef
Reading a file
$x= 5;
$y =$x * 3;
if ($y > 10) { print $s; }
$s = "Value of y is " .$y;
if ($y>10 && $s eq "") { exit; }
for ($x=1; $x
8/4/2019 Perl for Bio in for Ma Tics
43/158
Pattern-matching
A very sophisticated kind of logical test isto ask whether a string contains apattern
e.g. does a yeast promoter sequencecontain the MCB binding site, ACGCGT?
$name = "YBR007C";
$dna="TAATAAAAAACGCGTTGTCG";
if ($dna =~ /ACGCGT/)
{
print "$name has MCB!\n";
}
20 bases upstream ofthe yeast gene YBR007C
The pattern binding operator =~
Thepattern for the MCB binding siteYBR007C has MCB!
8/4/2019 Perl for Bio in for Ma Tics
44/158
FASTA format
A format for storing multiple named sequencesin a single file
This file contains 3' UTRsforDrosophila genes CG11604,CG11455 and CG11488
>CG11604
TAGTTATAGCGTGAGTTAGT
TGTAAAGGAACGTGAAAGAT
AAATACATTTTCAATACC>CG11455
TAGACGGAGACCCGTTTTTC
TTGGTTAGTTTCACATTGTA
AAACTGCAAATTGTGTAAAA
ATAAAATGAGAAACAATTCT
GGT
>CG11488TAGAAGTCAAAAAAGTCAAG
TTTGTTATATAACAAGAAAT
CAAAAATTATATAATTGTTT
TTCACTCT
Name of sequence is
preceded by > symbol
NB sequences canspan multiple lines
Call this file fly3utr.txt
8/4/2019 Perl for Bio in for Ma Tics
45/158
Printing all sequence names in aFASTA database
The key to this program is this block:
open FILE, "fly3utr.txt";
while ($x= ) {
if ($x=~ />/) {
print $x;
}
}close FILE;
>CG11604
>CG11455
>CG11488
if ($x=~ />/) {
print $x;
}
This pattern matches (and returns TRUE) if the defaultvariable $_ contains the FASTA sequence-name symbol >
This line prints $_ if
the pattern matched
8/4/2019 Perl for Bio in for Ma Tics
46/158
Pattern replacement
open FILE, "fly3utr.txt";
while () {
if (/>/) {
s/>//;
print;
}
}
close FILE;
CG11604
CG11455
CG11488
New statementremoves the ">"
The new statement s/>// is an example of a replacement.
General form: s/OLD/NEW/ replaces OLD with NEWThus s/>// replaces ">" with "" (the empty string)
$_ is thedefaultvariablefor theseoperations
8/4/2019 Perl for Bio in for Ma Tics
47/158
Finding all sequence lengthsOpen file
Read line
End of file?
Line starts with > ?
Remove \n newlinecharacter at end of line
Sequence name Sequence data
Add length of lineto running totalRecord the name
Reset running total ofcurrent sequence length
First sequence?Print lastsequencelength
Stop
noyes
yes
yes
no
no
Start
Print lastsequencelength
8/4/2019 Perl for Bio in for Ma Tics
48/158
Finding all sequence lengthsopen FILE, "fly3utr.txt";while () {
chomp;
if (/>/) {
if (defined $len) {
print "$name $len\n";
}$name =$_;
$len = 0;
} else {
$len += length;
}
}
print "$name $len\n";
close FILE;
>CG11604 58
>CG11455 83
>CG11488 68
The chomp statementtrims the newline character"\n" off the end of thedefault variable, $_.
Try it without this andsee what happens andif you can work out why
>CG11604
TAGTTATAGCGTGAGTTAGT
TGTAAAGGAACGTGAAAGAT
AAATACATTTTCAATACC
>CG11455
TAGACGGAGACCCGTTTTTC
TTGGTTAGTTTCACATTGTA
AAACTGCAAATTGTGTAAAA
ATAAAATGAGAAACAATTCT
GGT
>CG11488
TAGAAGTCAAAAAAGTCAAG
TTTGTTATATAACAAGAAAT
CAAAAATTATATAATTGTTT
TTCACTCT
8/4/2019 Perl for Bio in for Ma Tics
49/158
Reverse complementing DNA
$dna = "accACgttAGgtct";
$revcomp = lc($dna);
$revcomp = reverse($revcomp);
$revcomp =~ tr/acgt/tgca/;
print $revcomp;
agacctaacgtggt
Start by making string lower caseagain. This is generally good practise
Reverse the string
Replace 'a' with 't', 'c' with 'g',
'g' with 'c' and 't' with 'a'
A common operation due to double-helixsymmetry of DNA
8/4/2019 Perl for Bio in for Ma Tics
50/158
Running external programs
$lines = `wc -l myfile.txt`;
Suppose you want to get the output of another program into a variable.
e.g. the following shell command prints the number of lines in the file myfile.txt
wc -l myfile.txt
but that only prints the result to standard output; it does not give you access to theoutput of the command from within the Perl program.
An (equivalent) way is to open a pipe from the command:
open FILEHANDLE, "wc -l myfile.txt |";
$lines = ;
system "wc -l myfile.txt";You can execute a command like this from Perl using system
One way to get the output is by enclosing the command in backticks:
8/4/2019 Perl for Bio in for Ma Tics
51/158
Arrays
An arrayis a variable holding a list of items
We can think of this as a list with 4 entries
@nucleotides = ('a', 'c', 'g', 't');
print "Nucleotides: @nucleotides\n";
Nucleotides: a c g t
a c g telement 0
element 1 element 2element 3
the array is theset of all four elements
Note that the elementindices start at zero.
8/4/2019 Perl for Bio in for Ma Tics
52/158
Array literals
There are several, equally valid ways toassign an entire array at once.
@a = (1,2,3,4,5);
print "a =@a\n";
@b= ('a','c','g','t');
print "b=@b\n";
@c = 1..5;
print "c =@c\n";
@d = qw(a c g t);
print "d =@d\n";
a = 1 2 3 4 5
b= a c g t
c = 1 2 3 4 5
d = a c g t
This is the most common: a comma-
separated list, delimited by parentheses
8/4/2019 Perl for Bio in for Ma Tics
53/158
Accessing arrays
To access array elements, use square brackets;e.g. $x[0] means "element zero of array @x"
Remember, element indices start at zero!
If you use an array @x in a scalarcontext, suchas @x+0, then Perl assumes that you wanted the
length of the array.
@x= ('a', 'c', 'g', 't');
print $x[0], "\n";$i = 2;
print $x[$i], "\n";
a
g
@x= ('a', 'c', 'g', 't');
print @x+ 0;4
8/4/2019 Perl for Bio in for Ma Tics
54/158
Array operations
You can sort and reverse arrays...
You can read the entire contents of a file
into an array (each line of the file becomesan element of the array)
@x= ('a', 't', 'g', 'c');
@y = sort @x;
@z= reverse @y;
print "x=@x\n";print "y =@y\n";
print "z=@z\n";
x= a t g c
y = a c g t
z= t g c a
open FILE, "sequence.txt";
@x= ;
8/4/2019 Perl for Bio in for Ma Tics
55/158
push, pop, shift, unshift
@x= ("Fame", "Power", "Money");
print "I started with @x\n";
$y = pop @x;
push @x, "Success";
print "Then I had @x\n";
$z= shift @x;unshift @x, "Glamour";
print "Now I have @x\n";
print "I lost $y and $z\n";
I started with Fame Power Money
Then I had Fame Power Success
Now I have Glamour Power Success
I lost Money and Fame
pop removes the lastelement of an array
push adds an element
to the end of an arrayshift removes the firstelement of an array
unshift adds an elementto the start of an array
8/4/2019 Perl for Bio in for Ma Tics
56/158
foreach
Finding the total of a list of numbers:
Equivalent to:
@val = (4, 19, 1, 100, 125, 10);
$total = 0;
foreach $x (@val) {
$total +=$x;}
print $total; 259
@val = (4, 19, 1, 100, 125, 10);
$total = 0;
for ($i = 0; $i < @val; ++$i) {
$total +=$val[$i];
}
print $total; 259
foreach statement
loops through eachentry in an array
8/4/2019 Perl for Bio in for Ma Tics
57/158
Iterator comparison
foreach
for
iMac G5 1.8GHz 512MB, Mac OS X 10.4.2, perl v5.8.6 built for darwin-thread-multi-2level
[yoko:~] yam% time perl -e 'for ($n = 1; $n
8/4/2019 Perl for Bio in for Ma Tics
58/158
The @ARGV array
A special array is @ARGV
This contains the command-line
arguments when the program is invoked atthe Unix prompt
It's a way for the user to pass informationinto the program
8/4/2019 Perl for Bio in for Ma Tics
59/158
Exploding a sequence into an array
The programming language C treats allstrings as arrays
$dna = "accggtgtgcg";
print "String: $dna\n";
@array = split //, $dna;
print "Array: @array\n";
String: accggtgtgcg
Array: a c c g g t g t g c g
The split statement turnsa string into an array.Here, it splits after everycharacter, but we can alsosplit at specific points,like a restriction enzyme
8/4/2019 Perl for Bio in for Ma Tics
60/158
Taking a slice of an array
The syntax @x[i,j,k...] returns a (3-element)array containing elements i,j,k... of array @x
@nucleotides = ('a', 'c', 'g', 't');
@purines =@nucleotides[0,2];@pyrimidines =@nucleotides[1,3];
print "Nucleotides: @nucleotides\n";
print "Purines: @purines\n";
print "Pyrimidines: @pyrimidines\n";
Nucleotides: a c g t
Purines: a g
Pyrimidines: c t
8/4/2019 Perl for Bio in for Ma Tics
61/158
Finding elements in an array
The grep command is used to select some
elements from an array
The statement grep(EXPR,LIST) returns all
elements ofLIST for which EXPR evaluates totrue (when $_ is set to the appropriate element)
e.g. select all numbers over 100:
@numbers = (101, 235, 10, 50, 100, 66, 1005);
@numbersOver100 = grep ($_ > 100, @numbers);
print "Numbers: @numbers\n";
print "Numbers over 100: @numbersOver100\n";
Numbers: 101 235 10 50 100 66 1005
Numbers over 100: 101 235 1005
8/4/2019 Perl for Bio in for Ma Tics
62/158
Applying a function to an array
The map command applies a function to
every element in an array
Similar syntax to list: map(EXPR,LIST)applies EXPR to every element in LIST
Example: multiply every number by 3
@numbers = (101, 235, 10, 50, 100, 66, 1005);
@numbersTimes3 = map ($_ * 3, @numbers);
print "Numbers: @numbers\n";
print "Numbers times 3: @numbersTimes3\n";
Numbers: 101 235 10 50 100 66 1005
Numbers times 3: 303 705 30 150 300 198 3015
8/4/2019 Perl for Bio in for Ma Tics
63/158
Perl for Bioinformatics
Section 3: Patterns and Subroutines
8/4/2019 Perl for Bio in for Ma Tics
64/158
Review: pattern-matching
The following code:
prints the string "Found MCB binding site!" if the pattern "ACGCGT"is present in the default variable, $_
Instead of using $_ we can "bind" the pattern to another variable(e.g. $dna) using this syntax:
We can replace the first occurrence of ACGCGT with the string_MCB_ using the following syntax:
We can replace alloccurrences by appending a 'g':
if (/ACGCGT/) {
print "Found MCB binding site!\n";
}
if ($dna =~ /ACGCGT/) {
print "Found MCB binding site!\n";
}
$dna =~ s/ACGCGT/_MCB_/;
$dna =~ s/ACGCGT/_MCB_/g;
8/4/2019 Perl for Bio in for Ma Tics
65/158
Regular expressions
Perl provides a pattern-matching engine
Patterns are called regular expressions
They are extremely powerful probably Perl's strongest feature, compared to
other languages
Often called "regexps" for short
8/4/2019 Perl for Bio in for Ma Tics
66/158
QuickTime and adecompressor
are needed to see this picture.
Motivation:N-glycosylation motif
Common post-translational modification in ER Membrane & secreted proteins
Purpose:folding, stability, cell-cell adhesion
Attachment ofa 14-sugar oligosaccharide
Occurs at asparagine residues with theconsensus sequence NX1X2,where X1can be anything
(but proline & aspartic acid inhibit)
X2is serine or threonine Can we detect potentialN-glycosylation
sites in a protein sequence?
8/4/2019 Perl for Bio in for Ma Tics
67/158
Interlude: interactive testing
This script echoes input from the keyboard
Sometimes (e.g. in Windows IDEs) theoutput isnt printed until the script stops
This is because ofbuffering.
To stop buffering, set to "autoflush":
while () {
print;
}The special filehandle STDIN means"standard input", i.e. the keyboard
$| = 1;
while () {
print;
}
$| is the autoflush flag
8/4/2019 Perl for Bio in for Ma Tics
68/158
Matching alternative characters
[ACGT] matches one A, C, G or T:
In general square brackets denote a set ofalternative possibilities
Use - to match a range of characters: [A-Z]
. matches anything
\s matches spaces or tabs \S is anything that's not a space or tab
[^X] matches anything but X
while () {
print "Matched: $_" if /[ACGT]/;
}
this is not printed
This is printed
Matched: This is printed
Italics denoteinput text
8/4/2019 Perl for Bio in for Ma Tics
69/158
Matching alternative strings
/(this|that)/ matches "this" or "that"
...and is equivalent to /th(is|at)/
while () {print "Matched: $_" if /this|that|other/;
}
Won't match THIS
Will match this
Matched: Will match thisWon't match ThE oThER
Will match the other
Matched: Will match the other
Remember, regexpsare case-sensitive
8/4/2019 Perl for Bio in for Ma Tics
70/158
Matching multiple characters
x* matches zero or more x's (greedily) x*? matches zero or more x's (sparingly)
x+ matches one or more x's (greedily)
x{n} matches n x's
x{m,n} matches from m to n x's
Word and string boundaries ^ matches the start of a string $ matches the end of a string
\b matches word boundaries
8/4/2019 Perl for Bio in for Ma Tics
71/158
"Escaping" special characters
\ is used to "escape" characters that
otherwise have meaning in a regexp
so \[ matches the character "["
if not escaped, "[" signifies the start of a list ofalternative characters, as in [ACGT]
8/4/2019 Perl for Bio in for Ma Tics
72/158
Retrieving what was matched
If parts of the pattern are enclosed byparentheses, then (following the match) thoseparts can be retrieved from the scalars $1, $2...
e.g. /the (\S+) sat on the (\S+) drinking (\S+)/
matches "the cat sat on the mat drinking milk"
with $1="cat", $2="mat", $3="milk"
$| = 1;
while () {
if (/(a|the) (\S+)/i) {
print "Noun: $2\n";
}
}
Pick up the cup
Noun: cup
Sit on a chair
Noun: chair
Put the milk in the tea
Noun: milk
Note: only the first "the"is picked up by this regexp
8/4/2019 Perl for Bio in for Ma Tics
73/158
Variations and modifiers
//i ignores upper/lower case distinctions:
//g starts search where last match left off
pos($_)is index of first character after last match
s/OLD/NEW/ replaces first "OLD" with "NEW"
s/OLD/NEW/g is "global" (i.e. replaces everyoccurrence of "OLD" in the string)
pAttERn
Matched pAttERn
while () {
print "Matched: $_" if /pattern/i;
}
8/4/2019 Perl for Bio in for Ma Tics
74/158
N-glycosylation site detector
$| = 1;
while () {
$_= uc $_;
while (/(N[^PD][ST])/g) {print "Potential N-glycosylation sequence ",
$1, " at residue ", pos() - 2, "\n";
}
}
Convert to upper case
Regexp uses
'g' modifier toget all matchesin sequence
pos() is index of first residue
after match, starting at zero;so, pos()-2 is index of first residue
of three-residue match, starting at one.
while (/(N[^P][ST])/g) { ... }
The main regular expression
8/4/2019 Perl for Bio in for Ma Tics
75/158
PROSITE and Pfam
PROSITE a database of regular expressionsfor protein families, domains and motifs
Pfam a database ofHidden MarkovModels (HMMs) equivalent toprobabilistic regular expressions
8/4/2019 Perl for Bio in for Ma Tics
76/158
Subroutines
Often, we can identify self-contained tasks thatoccur in so many different places we may wantto separate their description from the rest of our
program. Code for such a task is called a subroutine.
Examples of such tasks:
finding the length of a sequence
reverse complementing a sequence
finding the mean of a list of numbers
NB: Perl providesthe subroutinelength(
$x)to do
this already
8/4/2019 Perl for Bio in for Ma Tics
77/158
Finding all sequence lengths (2)open FILE, "fly3utr.t
xt";
while () {
chomp;
if (/>/) {
print_name_and_len();
$name =$_;$len = 0;
} else {
$len += length;
}
}
print_name_and_len();
close FILE;
sub print_name_and_len {
if (defined ($name)) {
print "$name $len\n";
}
}
Subroutine definition;code in here is notexecuted unlesssubroutine is called
Subroutine calls
Reverse complement subroutine
8/4/2019 Perl for Bio in for Ma Tics
78/158
Reverse complement subroutinesub revcomp {
my $rev;
$rev = reverse ($dna);$rev =~ tr/acgt/tgca/;
return $rev;
}
$rev = 12345;
$dna = "accggcatg";
$rev1 = revcomp();
print "Revcomp of $dna is $rev1\n";
$dna = "cggcgt";
$rev2 = revcomp();print "Revcomp of $dna is $rev2\n";
print "Value of rev is $rev\n";
Revcomp of accggcatg is catgccggt
Revcomp of cggcgt is acgccg
Value of rev is 12345
Value of$rev is
unchanged bycalls to revcomp
"my" announces that$rev is localto the
subroutine revcomp
"return" announcesthat the return valueof this subroutineis whatever's in $rev
8/4/2019 Perl for Bio in for Ma Tics
79/158
Revcomp with argumentssubrevcomp {
my ($dna)=@_;
my $rev = reverse ($dna);
$rev =~ tr/acgt/tgca/;
return $rev;
}
$dna1 = "accggcatg";
$rev1 = revcomp ($dna1);
print "Revcomp of $dna1 is $rev1\n";
$dna2 = "cggcgt";
$rev2 = revcomp ($dna2);
print "Revcomp of $dna2 is $rev2\n";
Revcomp of accggcatg is catgccggt
Revcomp of cggcgt is acgccg
The array @_ holdsthe arguments tothe subroutine(in this case, thesequence to berevcomp'd)
Now we don'thave to re-usethe same variablefor the sequenceto be revcomp'd
8/4/2019 Perl for Bio in for Ma Tics
80/158
Mean & standard deviation@xdata = (1, 5, 1, 12, 3, 4, 6);
($x_mean, $x_sd)= mean_sd (@xdata);
@ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7);
($y_mean, $y_sd)= mean_sd (@ydata);
sub mean_sd {
my @data =@_;my $n =@data + 0;
my $sum = 0;
my $sqSum = 0;
foreach $x (@data) {
$sum +=$x;
$sqSum +=$x * $x;}
my $mean =$sum / $n;
my $variance =$sqSum / $n - $mean * $mean;
my $sd = sqrt ($variance);
return ($mean, $sd);
}
Subroutinereturns atwo-elementlist: (mean,sd)
Subroutine
takes a listof$n numeric
arguments
Square root
8/4/2019 Perl for Bio in for Ma Tics
81/158
Maximum element of an array
Subroutine to find the largest entry in an array
@num = (1, 5, 1, 12, 3, 4, 6);
$max= find_max (@num);
print "Numbers: @num\n";
print "Maximum: $max\n";
sub find_max {
my @data =@_;
my $max= pop @data;
foreach my $x (@data) {
if ($x > $max) {$max=$x;
}
}
return $max;
}
Numbers: 1 5 1 12 3 4 6
Maximum: 12
8/4/2019 Perl for Bio in for Ma Tics
82/158
Including variables in patterns
Subroutine to find number of instances ofa given binding site in a sequence
$dna = "ACGCGTAAGTCGGCACGCGTACGCGT";
$mcb= "ACGCGT";
print "$dna has ",count_matches ($mcb, $dna),
" matches to $mcb\n";
sub count_matches {
my ($pattern, $text)=@_;
my $n = 0;while ($text =~ /$pattern/g) { ++$n }
return $n;
}
ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT
8/4/2019 Perl for Bio in for Ma Tics
83/158
Perl for Bioinformatics
Section 4: Hashes
8/4/2019 Perl for Bio in for Ma Tics
84/158
Data structures
Suppose we have a file containing a tableofDrosophila gene names and cellularcompartments, one pair on each line:
Cyp12a5 Mitochondrion
MRG15 Nucleus
Cop Golgi
bor CytoplasmBx42 Nucleus
Suppose this file is in "genecomp.txt"
8/4/2019 Perl for Bio in for Ma Tics
85/158
Reading a table of data
We can split eachline into a 2-elementarray using thesplit command.
This breaks the lineat each space:
The opposite ofsplit is join, which makes a scalarfrom an array:
open FILE, "genecomp.txt";
while () {
($g, $c)= split;
push @gene, $g;
push @comp, $c;
}close FILE;
print "Genes: @gene\n";
print "Compartments: @comp\n";
Genes: Cyp12a5 MRG15 Cop bor Bx42
Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus
print join (" and ", @gene);
Cyp12a5 and MRG15 and Cop and bor and Bx42
8/4/2019 Perl for Bio in for Ma Tics
86/158
Finding an entry in a table
The following code assumes that we'vealready read in the table from the file:
Example:$ARGV[0] = "Cop"
$geneToFind = shift @ARGV;
print "Searching for gene $geneToFind\n";
for ($i = 0; $i < @gene; ++$i) {
if ($gene[$i] eq $geneToFind) {print "Gene: $gene[$i]\n";
print "Compartment: $comp[$i]\n";
exit;
}
}
print "Couldn't find gene\n";
Searching for gene Cop
Gene: Cop
Compartment: Golgi
8/4/2019 Perl for Bio in for Ma Tics
87/158
Binary search
The previous algorithm is inefficient. If there are Nentries in the list, then on average we have to searchthrough (N+1) entries to find the one we want.
For the full Drosophila genome, N=12,000. This ispainfully slow.
An alternative is the Binary Search algorithm:
Start with a sorted list.
Compare the middle element
with the one we want. Pick thehalf of the list that contains ourelement.
Iterate this procedure to"home in" on the right element.This takes around log
2
(N) steps.
8/4/2019 Perl for Bio in for Ma Tics
88/158
Associative arrays (hashes)
Implementing algorithms like binary searchis a common task in languages like C.
Conveniently, Perl provides a type of array
called an associative array(also called ahash) that is pre-indexed for quick search.
An associative array is a set of keyvalue pairs(like our genecompartment table)
$comp{"Cop"} = "Golgi"; Curly braces {} are used toindex an associative array
8/4/2019 Perl for Bio in for Ma Tics
89/158
Reading a table using hashes
open FILE, "genecomp.txt";
while () {
($g, $c)= split;
$comp{$g} =$c;
}
$geneToFind = shift@ARGV;print "Gene: $geneToFind\n";
print "Compartment: ", $comp{$geneToFind}, "\n";
Gene: CopCompartment: Golgi
...with $ARGV[0] = "Cop" as before:
8/4/2019 Perl for Bio in for Ma Tics
90/158
Reading a FASTA file into a hashsub read_FASTA {
my ($filename)=@_;
my (%name2seq, $name, $seq);
open FILE, $filename;
while () {
chomp;
if (/>/) {
s/>//;
if (defined $name) {
$name2seq{$name} =$seq;
}
$name =$_;
$seq = "";
} else {
$seq .
=$_
;
}
}
$name2seq{$name} =$seq;
close FILE;
return %name2seq;
}
8/4/2019 Perl for Bio in for Ma Tics
91/158
Formatted output of sequences
sub print_seq {my ($name, $seq)=@_;
print ">$name\n";
my $width = 50;
for (my $i = 0; $i < length($seq); $i +=$width) {
if ($i +$width > length($seq)) {
$width
=length(
$seq
)-$i;
}
print substr ($seq, $i, $width), "\n";
}
}
The term substr($x,$i,$len) returns the substring of$x starting at position $i with length $len.
For example, substr("Biology",3,3) is "log"
50-column output
8/4/2019 Perl for Bio in for Ma Tics
92/158
keys and values
keys returns the list of keys in the hash e.g. names, in the %name2seq hash
values returns the list of values
e.g. sequences, in the %name2seq hash%name2seq = read_FASTA ("fly3utr.txt");
print "Sequence names: ",
join (" ", keys (%name2seq)), "\n";
my $len = 0;
foreach$seq (values %name2seq
){
$len += length ($seq);
}
print "Total length: $len\n";
Sequence names: CG11488 CG11604 CG11455
Total length: 210
8/4/2019 Perl for Bio in for Ma Tics
93/158
Files of sequence names
Easy way to specify a subset of a givenFASTA database
Each line is the name of a sequence in a
given database e.g. CG1167
CG685
CG1041CG1043
8/4/2019 Perl for Bio in for Ma Tics
94/158
Get named sequences
Given a FASTA database and a "file of sequencenames", print every named sequence:
($fasta, $fosn)=@ARGV;
%name2seq = read_FASTA ($fasta);
open FILE, $fosn;
while ($name = ) {chomp $name;
$seq =$name2seq{$name};
if (defined $seq) {
print_seq ($name, $seq);
} else {
warn "Can't find sequence: $name. ","Known sequences: ",
join (" ", keys %name2seq), "\n";
}
}
close FILE;
8/4/2019 Perl for Bio in for Ma Tics
95/158
Intersection of two sets
Two files of sequence names:
What is the overlap?
Find intersection using hashes:
CG1167
CG685
CG1041
CG1043
CG215
CG1041
CG483
CG1167
CG1163
open FILE1, "fosn1.txt";
while () { $gotName{$_} = 1; }
close FILE1;
open FILE2, "fosn2.txt";
while () {print if $gotName{$_};
}
close FILE2;
fosn1.txt
fosn2.txt
CG1041
CG1167
8/4/2019 Perl for Bio in for Ma Tics
96/158
Assigning hashes
A hash can be assigned directly,as a list of "key=>value" pairs:
%comp = ('Cyp12a5' => 'Mitochondrion',
'MRG15'=> 'Nucleus',
'Cop' => 'Golgi',
'bor' => 'Cytoplasm',
'Bx42' => 'Nucleus');
print "keys: ", join(";",keys(%comp)), "\n";
print "values: ", join(";",values(%comp)), "\n";
keys: bor;Cop;Bx42;Cyp12a5;MRG15
values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus
The genetic code as a hash
8/4/2019 Perl for Bio in for Ma Tics
97/158
The genetic code as a hash%aa = ('ttt'=>'F', 'tct'=>'S', 'tat'=>'Y', 'tgt'=>'C',
'ttc'=>'F', 'tcc'=>'S', 'tac'=>'Y', 'tgc'=>'C','tta'=>'L', 'tca'=>'S', 'taa'=>'!', 'tga'=>'!',
'ttg'=>'L', 'tcg'=>'S', 'tag'=>'!', 'tgg'=>'W',
'ctt'=>'L', 'cct'=>'P', 'cat'=>'H', 'cgt'=>'R',
'ctc'=>'L', 'ccc'=>'P', 'cac'=>'H', 'cgc'=>'R',
'cta'=>'L', 'cca'=>'P', 'caa'=>'Q', 'cga'=>'R',
'ctg'=>'L', 'ccg'=>'P', 'cag'=>'Q', 'cgg'=>'R',
'att'=>'I', 'act'=>'T', 'aat'=>'N', 'agt'=>'S',
'atc'=>'I', 'acc'=>'T', 'aac'=>'N', 'agc'=>'S',
'ata'=>'I', 'aca'=>'T', 'aaa'=>'K', 'aga'=>'R',
'atg'=>'M', 'acg'=>'T', 'aag'=>'K', 'agg'=>'R',
'gtt'=>'V', 'gct'=>'A', 'gat'=>'D', 'ggt'=>'G',
'gtc'=>'V', 'gcc'=>'A', 'gac'=>'D', 'ggc'=>'G',
'gta'=>'V', 'gca'=>'A', 'gaa'=>'E', 'gga'=>'G',
'gtg'=>'V', 'gcg'=>'A', 'gag'=>'E', 'ggg'=>'G' );
8/4/2019 Perl for Bio in for Ma Tics
98/158
Translating: DNA to protein$prot = translate ("gatgacgaaagttgt");
print $prot;
sub translate {
my ($dna)=@_;
$dna = lc ($dna);
my $len = length ($dna);
if ($len % 3 != 0) {
die "Length $len is not a multiple of 3";
}
my $protein = "";
for (my $i = 0; $i < $len; $i += 3) {
my $codon = substr ($dna, $i, 3);
if (!defined ($aa{$codon})) {
die "Codon $codon is illegal";
}
$protein .=$aa{$codon};
}
return $protein;
} DDESC
8/4/2019 Perl for Bio in for Ma Tics
99/158
Counting residue frequencies
%count = count_residues ("gatgacgaaagttgt");
@residues = keys (%count);
foreach $residue (@residues) {
print "$residue: $count{$residue}\n";
}
sub count_residues {
my ($seq)=@_;
my %freq;
$seq = lc ($seq);
for (my $i = 0; $i < length($seq); ++$i) {
my $residue = substr ($seq, $i, 1);
++
$freq{
$residue};
}
return %freq;
}
g: 5
a: 5
c: 1
t: 4
8/4/2019 Perl for Bio in for Ma Tics
100/158
Counting N-mer frequencies
%count = count_nmers ("gatgacgaaagttgt", 2);
@nmers = keys (%count);
foreach $nmer (@nmers) {
print "$nmer: $count{$nmer}\n";
}
sub count_nmers {
my ($seq, $n)=@_;
my %freq;
$seq = lc ($seq);
for (my $i = 0; $i
8/4/2019 Perl for Bio in for Ma Tics
101/158
N-mer frequencies for a whole file
my %name2seq = read_FASTA ("fly3utr.txt");while (($name, $seq)= each %name2seq) {
%count = count_nmers ($seq, 2, %count);
}
@nmers = keys (%count);
foreach $nmer (@nmers) {
print "$nmer: $count{$nmer}\n";
}
sub count_nmers {
my ($seq, $n, %freq)=@_;
$seq = lc ($seq);
for (my $i = 0; $i
8/4/2019 Perl for Bio in for Ma Tics
102/158
Files and filehandles
Opening a file:
Closing a file:
Reading a line:
Reading an array:
Printing a line:
Read-only:
Write-only: Test if file exists:
open XYZ, $filename;
close XYZ;
This XYZ is the filehandle
$data = ;
@data = ;
print XYZ $data;
open XYZ, "$filename";
if (-e $filename) {
print "$filename exists!\n";
}
8/4/2019 Perl for Bio in for Ma Tics
103/158
Perl for Bioinformatics
Section 5: References
B hi d th S
8/4/2019 Perl for Bio in for Ma Tics
104/158
Behind the Scenes
PC = memory + CPU(+ peripherals)
Memory is just a list of bytes(e.g. 227 bytes in a machine with
128Mb of RAM) To a first approximation, this is
just one huge array. The arrayindex is called the address
some of the array elementsare interpreted as instructioncodes by the CPU
CPU
39243
65
216
012
227 -1227 -2
addresses
45113
B ff fl tt k
8/4/2019 Perl for Bio in for Ma Tics
105/158
Buffer overflow attack
8/4/2019 Perl for Bio in for Ma Tics
106/158
H d i l t ti
8/4/2019 Perl for Bio in for Ma Tics
107/158
Hexadecimal notation
Computers use binary notation, which is tricky to interconvertto/from decimal notation
however, binary notation is big & unwieldy
A compromise is to use hexadecimal
Hexadecimal is base 16 (decimal is base 10, binary is base 2)
The letters A-F are used to represent the extra digits for 10-15
Binary: Decimal: Hexadecimal:
101 5 5
1011 11 B
11100 28 1C
101000011 323 143
R f
8/4/2019 Perl for Bio in for Ma Tics
108/158
References
Recall the subroutine find_max(@x) which returns the largestelement in the array @x
Count the number of times we create an array in this code.
All in all, we've created three copies of this array. Each copy usesup time and memory. This seems unnecessary... and it is.
Instead of passing the whole array into the subroutine, we couldsimply tell the subroutine where in memory the array begins.
The memory address of a particular variable is called a reference tothat variable. This is a useful abstraction.
Addresses are often displayed in hexadecimal.
@x= (1, 5, 1, 12, 3, 4, 6);
$max= find_max (@x);
sub find_max {
my @data =@_;
...
Array @x created here@x copied into @_ here
@_ copied into @data here
R f t
8/4/2019 Perl for Bio in for Ma Tics
109/158
Reference syntax
To create a reference to a scalar, $x:
an array, @x:
a hash,%x:
To access a reference to
a scalar:
an array:
an array element: a hash:
a hash element:
Alternative syntax for arrays:
$scalar_ref = \$x;
$array_ref = \@x;
$hash_ref = \%x;
$x=$$scalar_ref;
@x=@$array_ref;
%x= %$hash_ref;
$x=$array_ref->[3];
$x=$hash_ref->{'key'};
$x=$$array_ref[3];
R f t l
8/4/2019 Perl for Bio in for Ma Tics
110/158
References to scalars$x= 10;
$y = 20;print "Initially: x=$x, y=$y\n";
$xReference = \$x;
print "X-reference: $xReference\n";
print "Referenced variable: $$xReference\n";
$$xReference += 3;
print "Now: x=$x, y=$y\n";
$yReference = \$y;print "Y-reference: $yReference\n";
print "Referenced variable: $$yReference\n";
$$yReference *= 2;
print "Finally: x=$x, y=$y\n";
Initially: x=10, y=20X-reference: SCALAR(0x1832ac0)
Referenced variable: 10
Now: x=13, y=20
Y-reference: SCALAR(0x1832ae4)
Referenced variable: 20
Finally: x=13, y=40
This referencepoints to $x
This changesthe value of$x
This referencepoints to $y
This changesthe value of$y
This is the memorylocation used to store $x
This is the memorylocation used to store $y
R f t
8/4/2019 Perl for Bio in for Ma Tics
111/158
References to arrays@x= ('a', 'c', 'g', 't');
@y = 1..10;print "x: @x\n";
print "y: @y\n";
$xReference = \@x;
print "X-reference: $xReference\n";
print "Referenced array: @$xReference\n";
$$xReference[3] =~ tr/t/u/;
print "New x: @x\n";$yReference = \@y;
print "Referenced array: @$yReference\n";
$yReference->[3] *= 2;
print "New y: @y\n";
x: a c g t
y: 1 2 3 4 5 6 7 8 9 10X-reference: ARRAY(0x1832b08)
Referenced array: a c g t
New x: a c g u
Referenced array: 1 2 3 4 5 6 7 8 9 10
New y: 1 2 3 8 5 6 7 8 9 10
This referencepoints to @x
This referencepoints to @y
This changes the4th element of@x
This changes the4th element of@y
(NB alternative notation)
Note that the type of referenceis now ARRAY, not SCALAR
R f t h h
8/4/2019 Perl for Bio in for Ma Tics
112/158
References to hashes%comp = ('Cyp12a5' => 'Mitochondrion',
'MRG15' => 'Nucleus',
'Cop' => 'Golgi',
'bor' => 'Cytoplasm',
'Bx42' => 'Nucleus');
$ref = \%comp;
print "Values: ", join(" ",values(%comp)), "\n";
print "Ref:$ref\n";
print "Ref values: ", join(" ",values(%$ref)), "\n";
$$ref{'MRG15'} =~ s/N/n/;
print "New values: ", join(" ",values(%comp)), "\n";
Values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus
Ref: HASH(0x1832b08)Ref values: Cytoplasm Golgi Nucleus Mitochondrion Nucleus
New values: Cytoplasm Golgi Nucleus Mitochondrion nucleus
The referencepoints to %comp
This changes$comp{'MRG15'}
Note lower-case 'n' after change
References to s bro tines
8/4/2019 Perl for Bio in for Ma Tics
113/158
References to subroutines
We can also have references to subroutines
Syntax for assigning a subroutine reference:
Syntax for calling a subroutine reference:
Anonymous subroutines:
$subref = \&read_FASTA;
%name2seq = &$subref ("fly3utr.txt");
$subref = sub { print "Hello world\n"; };&$subref(); Hello world
References to code
8/4/2019 Perl for Bio in for Ma Tics
114/158
References to codesub hello {
print "Hello @_!\n";
}
my $codeRef1 = \&hello;
&$codeRef1 ("Mr", "President");
print "Ref:$codeRef1\n";
my $codeRef2 = sub { print "Goodbye @_!" };
&$codeRef2 ("cruel", "world");
Hello Mr President!
Ref: CODE(0x180cc3c)Goodbye cruel world!
The referencepoints to thesubroutine hello
This is an anonymoussubroutine reference
An anonymous subroutine is one that is never named, but only referenced.
Well be seeing more about anonymous references on the following slides.
Reasons for references
8/4/2019 Perl for Bio in for Ma Tics
115/158
Reasons for references
Increased efficiency/performance (pass areference instead of the whole thing)
Allowing a subroutine to modify the value
of a variable, and have this modification bepropagated back to the caller of thesubroutine
Allowing arrays/hashes to contain(references to) other arrays/hashes
Abstract representation of subroutines
Anonymous arrays and hashes
8/4/2019 Perl for Bio in for Ma Tics
116/158
Anonymous arrays and hashes
Recall the syntax for assigning an entire array...
...and the syntax for assigning an entire hash...
We can also create an array and assign a reference to it,without explicitly naming the array variable:
This is called an anonymous array. We can also create anonymous hashes:
@nucleotide = ('a', 'c', 'g', 't');
%dna2rna = ('a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u');
$nucleotide_ref =['a', 'c', 'g', 't'];
Note square brackets
instead of parentheses
$dna2rna_ref = {'a'=>'a', 'c'=>'c', 'g'=>'g', 't'=>'u'};
Note curly brackets
Arrays of arrays
8/4/2019 Perl for Bio in for Ma Tics
117/158
Arrays of arrays
More precisely, arrays ofreferences-to-arrays.
Suppose we want to represent this matrix:
We could do it like this:
Or, more succinctly, like this:
$row1 =[0,0,0,2];
$row2 =[0,0,3,0];
$row3 =[0,3,0,1];
$row4 =[2,0,1,0];
@matrix= ($row1,$row2,$row3,$row4);
0 0 0 2
0 0 3 0
0 3 0 1
2 0 1 0
@matrix= ([0,0,0,2],
[0,0,3,0],
[0,3,0,1],
[2,0,1,0]);
@matrix is an array
of references to arrays
This matrix could be a table of RNA base-pairing scoresif the row and column indices are (A,C,G,U). The score of apair is the number of strong hydrogen bonds that it forms.Thus, A-U and U-A pairs score +2; C-G and G-C pairs score
+3; G-U and U-G pairs score +1; and all other pairs score 0.
8/4/2019 Perl for Bio in for Ma Tics
118/158
Arrays in Cand C++
C has nothing like Perls hashes,although various libraries(e.g. GLIB) have equivalents.
C++s Standard Template Libraryoffers the map template, which
is similar to a hash.
The vector is a C++ template.
Templates (like C arrays) arestrongly typed, unlike Perls
weakly typed arrays & hashes.
8/4/2019 Perl for Bio in for Ma Tics
119/158
Genome annotations
GFF annotation format
8/4/2019 Perl for Bio in for Ma Tics
120/158
GFF annotation format
Nine-column tab-delimited format for simple annotations:
Many of these now obsolete, but name/start/end/strand (andsometimes type) are useful
Methods: read, write, compareTo(GFF_file), getSeq(FASTA_file)
SEQ1 EMBL atg 103 105 . + 0 group1SEQ1 EMBL exon 103 172 . + 0 group1SEQ1 EMBL splice5 172 173 . + . group1SEQ1 netgene splice5 172 173 0.94 + . group1
SEQ1 genie sp5-20 163 182 2.3 + . group1SEQ1 genie sp5-10 168 177 2.1 + . group1SEQ2 grail ATG 17 19 2.1 - 0 group2
Sequencename
Program
Feature
typeStart
residue(starts at 1)
End
residue(starts at 1) Score
Strand
(+ or -)
Codingframe
("." if notapplicable)
Group
Reading a GFF file
8/4/2019 Perl for Bio in for Ma Tics
121/158
Reading a GFF file
This subroutine reads a GFF file
Each line is made into an array via the split command
The subroutine returns an array of such arrays
sub read_GFF {my ($filename)=@_;
open GFF, "
8/4/2019 Perl for Bio in for Ma Tics
122/158
Writing a GFF file
We should be able to write as well as read all datatypes
Each array is made into a line via the join command
Arguments: filename & reference to array of arrays
sub write_GFF {my ($filename, $gffRef)=@_;
open GFF, ">$filename" or die $!;
foreach my $gff (@$gffRef) {
print GFF join ("\t", @$gff), "\n";
}
close GFF or die $!;
}
open evaluates FALSE ifthe file failed to open, and$! contains the error message
close evaluates FALSE if
there was an error with the file
GFF intersect detection
8/4/2019 Perl for Bio in for Ma Tics
123/158
GFF intersect detection
Let (name1,start1,end1) and (name2,start2,end2) be the co-ordinates of two segments
If they don't overlap, there are three possibilities: name1 and name2 are different;
name1= name
2but start
1> end
2;
name1 = name2 but start2 > end1;
Checking every possible pair takes time N2 to run, whereN is the number ofGFF lines (how can this be improved?)
Self intersection of a GFF file
8/4/2019 Perl for Bio in for Ma Tics
124/158
Self-intersection of a GFF file
sub self_intersect_GFF {my @gff =@_;
my @intersect;
foreach $igff (@gff) {
foreach $jgff (@gff) {
if ($igff ne $jgff) {
if ($$igff[0] eq $$jgff[0]) {
if (!($$igff[3] > $$jgff[4]
|| $$jgff[3] > $$igff[4])) {
push @intersect, $igff;
last;
}
}
}
}
}
return @intersect;
}
Note: this code is slow.Vast improvements in
speed can be gained ifwe sortthe @gff array
before checking forintersection.
Fields 0, 3 and 4 of theGFF line are thesequence name, start
and end co-ordinates ofthe feature
Converting GFF to sequence
8/4/2019 Perl for Bio in for Ma Tics
125/158
Converting GFF to sequence
Puts together several previously-described subroutines
Namely: read_FASTA read_GFF revcomp print_seq
($gffFile, $seqFile)=@ARGV;
@gff = read_GFF ($gffFile);
%seq = read_FASTA ($seqFile);foreach $gffLine (@gff) {
$seqName =$gffLine->[0];
$seqStart =$gffLine->[3];
$seqEnd =$gffLine->[4];
$seqStrand =$gffLine->[6];
$seqLen =$seqEnd + 1 - $seqStart;
$subseq = substr ($seq{$seqName}, $seqStart-1, $seqLen);if ($seqStrand eq "-") { $subseq = revcomp ($subseq); }
print_seq ("$seqName/$seqStart-$seqEnd/$seqStrand", $subseq);
}
DNA Microarrays
8/4/2019 Perl for Bio in for Ma Tics
126/158
y
Normalizing microarray data
8/4/2019 Perl for Bio in for Ma Tics
127/158
Normalizing microarray data
Often microarray data are normalizedas aprecursor to further analysis (e.g. clustering)
This can eliminate systematic bias; e.g.
if every level for a particular gene is elevated, thismight signal a problem with the probe for that gene
if every level for a particular experiment is elevated,there might have been a problem with thatexperiment, or with the subsequent image analysis
Normalization is crude (it can eliminate realsignal as well as noise), but common
Rescaling an array
8/4/2019 Perl for Bio in for Ma Tics
128/158
Rescaling an array
For each element of the array:add a, then multiply by b
@array = (1, 3, 5, 7, 9);
print "Array before rescaling: @array\n";
rescale_array (\@array, -1, 2);print "Array after rescaling: @array\n";
sub rescale_array {
my ($arrayRef, $a, $b)=@_;
foreach my $x (@$arrayRef) {
$x= ($x+$a) * $b;
}}
Array before rescaling: 1 3 5 7 9
Array after rescaling: 0 4 8 12 16
Array ispassedby reference
Microarray expression data
8/4/2019 Perl for Bio in for Ma Tics
129/158
Microarray expression data
A simple format with tab-separated fields First line contains experiment names
Subsequent lines contain:
gene name expression levels for each experiment
* EmbryoStage1 EmbryoStage2 EmbryoStage3 ...
Cyp12a5 104.556 102.441 55.643 ...
MRG15 4590.15 6691.11 9472.22 ...
Cop 33.12 56.3 66.21 ...
bor 5512.36 3315.12 1044.13 ...
Bx42 1045.1 632.7 200.11 ...
... ... ... ...
Messages: readFrom(file), writeTo(file), normalizeEachRow, normalizeEachColumn
Reading a file of expression data
8/4/2019 Perl for Bio in for Ma Tics
130/158
Reading a file of expression datasub read_expr {
my ($filename)=@_;open EXPR, "
8/4/2019 Perl for Bio in for Ma Tics
131/158
Normalizing by gene
A program to normalize expression datafrom a set of microarray experiments
Normalizes by gene
($experiment, $expr)= read_expr ("expr.txt");
while (($geneName, $lineRef)= each %$expr) {
normalize_array ($lineRef);}
sub normalize_array {
my ($data)=@_;
my ($mean, $sd)= mean_sd (@$data);@$data= map (($_ - $mean) / $sd, @$data);
}
NB $data
is a reference
to an array
Could also use the following:rescale_array($data,-$mean,1/$sd);
Normalizing by column
8/4/2019 Perl for Bio in for Ma Tics
132/158
Normalizing by column
Remaps gene arrays to column arrays
($experiment, $expr)
= read_expr ("expr.txt");
my @genes = sort keys %$expr;for ($i = 0; $i < @$experiment; ++$i) {
my @col;
foreach $j (0..@genes-1) {
$col[$j] =$expr->{$genes[$j]}->[$i];
}
normalize_array(\@col);foreach $j (0..@genes-1) {
$expr->{$genes[$j]}->[$i] =$col[$j];
}
}
Puts columndata in @col
Puts @colback into %expr
Normalizes (note useof reference)
8/4/2019 Perl for Bio in for Ma Tics
133/158
Perl for Bioinformatics
Section 6: Advanced topics
Sorting
8/4/2019 Perl for Bio in for Ma Tics
134/158
Sorting
It is often useful to be able to sort an array e.g. smallest element first, largest last
Many sort algorithms exist Bubblesort (swaps)
Quicksort (pivots)
Binary tree sort (inserts)
Typically, in older languages, you have toimplement one of these yourself although qsort is provided in C
This is changing...
Sorting string data
8/4/2019 Perl for Bio in for Ma Tics
135/158
Sorting string data
Perl provides the sort function to sort an
array of strings into alphabetic order:
@nucleotides = ('g', 'c', 't', 'a');
@sorted_nucleotides = sort @nucleotides;print "Nucleotides: @nucleotides\n";
print "Sorted: @sorted_nucleotides\n";
Nucleotides: g c t a
Sorted: a c g t
Sorting numeric data
8/4/2019 Perl for Bio in for Ma Tics
136/158
Sorting numeric data
To sort numeric data, we have to provide a sort function
This is a subroutine that compares two items, $a and $b
It must return -1 if$a$b
Fortunately, Perl provides an operator that does just this.It is the spaceship operator$a $b
The syntax is as follows:
@x= (5, 1, 16, 2, -1, 10);
@y = sort by_number @x;
print "y: @y\n";
subby_number {
return $a $b;
} y: -1 1 2 5 10 16
The variables $a and $b getpassed "automagically" into this
subroutine. Yet another example ofarbitrary Perl weirdness...
Standard sort functions
8/4/2019 Perl for Bio in for Ma Tics
137/158
Standard sort functions
$a $b is the "standard" numeric sort
The "standard" alphabetic sort is $a cmp $b
The alphabetic sort is the one used by default:
$x= "Pears";$y = "Apples";
$z= "Oranges";
print "$x cmp $y: ", $x cmp $y, "\n";
print "$x cmp $z: ", $x cmp $z, "\n";
print "$y cmp $z: ", $y cmp $z, "\n";
print "$x cmp $x: ", $x cmp $x, "\n";
Pears cmp Apples: 1
Pears cmp Oranges: 1
Apples cmp Oranges: -1
Pears cmp Pears: 0
Sorting a GFF file
8/4/2019 Perl for Bio in for Ma Tics
138/158
Sorting a GFF file
We can "chain" multiple sort functions to sort bysequence name, then by startpoint, then by endpoint:
This works because (X or Y or Z) = X (if X!=0)or Y (if X==0 and Y != 0)or Z (if X==Y==0)
($infile, $outfile)=@ARGV;
@gff = read_GFF ($infile);
@gff = sort by_GFF_startpoint (@gff);write_GFF ($outfile, \@gff);
subby_GFF_startpoint {
return ($$a[0] cmp $$b[0]
or $$a[3] $$b[3]
or $$a[4] $$b[4]);
}
"chaining" multiplesort comparisons
this line doesthe actual sort
Fields 0, 3 and 4 of theGFF line are thesequence name, start
and end co-ordinates ofthe feature
Packages
8/4/2019 Perl for Bio in for Ma Tics
139/158
Packages
Perl allows you to organise your subroutines inpackages each with its own namespace
Perl looks for the packages in a list of directoriesspecified by the array @INC
Many packages available athttp://www.cpan.org/
use PackageName;
PackageName::doSomething();
This line includes a file called"PackageName.pm" in your code
print "INC dirs: @INC\n";
INC dirs: Perl/lib Perl/site/lib.The "." means thedirectory that thescript is saved in
This invokes a subroutine called doSomething()in the package called "PackageName.pm"
Object-oriented programming
8/4/2019 Perl for Bio in for Ma Tics
140/158
Object oriented programming
Data structures are often associated with code FASTA: read_FASTA print_seq revcomp ...
GFF: read_GFF write_GFF ...
Expression data: read_expr mean_sd...
Object-oriented programming makes thisassociation explicit.
A type of data structure, with an associated set of
subroutines, is called a class The subroutines themselves are called methods
A particular instance of the class is an object
OOP concepts
8/4/2019 Perl for Bio in for Ma Tics
141/158
OOP concepts
Abstraction represent the essentials, hide the details
Encapsulation storing data and subroutines in a single unit
hiding private data (sometimes all data, via accessors)
Inheritance abstract base interfaces
multiple derived classes
Polymorphism different derived classes exhibit different behaviors in
response to the same requests
OOP: Analogy
8/4/2019 Perl for Bio in for Ma Tics
142/158
OOP: Analogy
8/4/2019 Perl for Bio in for Ma Tics
143/158
o Messages (the words in the speech balloons, and also perhaps the coffee itself)
o Overloading (Waiter's response to "A coffee", different response to "A black coffee")
o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)
o Encapsulation (Customer doesn't need to know about Kitchen)
o Inheritance (not exactlyused here, except implicitly: all types of coffee can be drunk orspilled, all humans can speak basic English and hold cups of coffee, etc.)
o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge, the Kitchen is
a Factory (and perhaps the Waiter is too), asking for coffee is a Factory Method, etc.
OOP: Advantages
8/4/2019 Perl for Bio in for Ma Tics
144/158
OOP: Advantages
Often more intuitive Data has behavior
Modularity Interfaces are well-defined
Implementation details are hidden
Maintainability Easier to debug, extend
Framework for code libraries Graphics & GUIs
BioPerl, BioJava
OOP: Jargon Member method
8/4/2019 Perl for Bio in for Ma Tics
145/158
Member, method A variable/subroutine associated with a particular class
Overriding When a derived class implements a method differently from its
parent class
Constructor, destructor
Methods called when an object is created/destroyed Accessor
A method that provides [partial] access to hidden data
Factory
An [abstract] object that creates other objects Singleton
A class which is only everinstantiatedonce (i.e. theres only everone object of this class)
C.f. static member variables, which occur once per class
Objects in Perl
8/4/2019 Perl for Bio in for Ma Tics
146/158
An object in Perl is usually a reference to a hash The method subroutines for an object are foundin a class-specific package Command bless $x, MyPackage associates
variable $x with package MyPackage
Syntax of method calls e.g. $x->save();
this is equivalent to PackageName::save($x);
Typical constructor: PackageName->new();
@EXPORT and @EXPORT_OK arrays used toexport method names to users namespace
Many useful Perl objects available at CPAN
AUTOLOAD
8/4/2019 Perl for Bio in for Ma Tics
147/158
When an undefined method is called on anobject, the special method AUTOLOAD iscalled, if defined
Special variable $AUTOLOAD containsfunction name
Allows implementation of e.g. defaultaccessors for hash elements
GD.pm
8/4/2019 Perl for Bio in for Ma Tics
148/158
p
A graphics package by Lincoln Steinuse GD;
# create a new image
$im = new GD::Image(100,100);
# allocate some colors
$white =$im->colorAllocate(255,255,255);
$black
=$im->colorAllocate(0,0,0
);
$red =$im->colorAllocate(255,0,0);
$blue =$im->colorAllocate(0,0,255);
# make the background transparent
$im->transparent($white);
# Put a black frame around the picture
$im->rectangle(0,0,99,99,$black);
# Draw a blue oval$im->arc(50,50,95,75,0,360,$blue);
# And fill it with red
$im->fill(50,50,$red);
# Convert the image to PNG and print it out
print $im->png;
CGI.pm
8/4/2019 Perl for Bio in for Ma Tics
149/158
p
CGI (Common Gateway Interface) Page-based web programming paradigm
CGI.pm (also by Lincoln Stein)
Perl CGI interface runs on a webserver
allows you to write a program that runs behinda webpage
CGI (static, page-based) is gradually beingsupplemented by AJAX
BioPerl
8/4/2019 Perl for Bio in for Ma Tics
150/158
A set of Open Source Bioinformaticspackages largely object-oriented
Can be downloaded from bio.perl.org Handles various different file formats
Parses BLAST and other programs
Basis for Ensembl the human genome annotation project www.ensembl.org
Example: GenBank
8/4/2019 Perl for Bio in for Ma Tics
151/158
p
Example: Bio::DB::GenBank
8/4/2019 Perl for Bio in for Ma Tics
152/158
p
Interface to the GenBank database
Saves having to rewrite same old parsers
use Bio::DB::GenBank;
$gb= new Bio::DB::GenBank;
$seq =$gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
# or ...
$seq =$gb->get_Seq_by_acc('J00522'); # Accession Number
$seq =$gb->get_Seq_by_version('J00522.1'); # Accession.version
$seq =$gb->get_Seq_by_gi('405830'); # GI Number
Digest::MD5
8/4/2019 Perl for Bio in for Ma Tics
153/158
g
MD5 is a one-way hash function
e.g. gravatar.com uses MD5 to map(authenticated) email addresses to avatar icons
Digest::MD5
8/4/2019 Perl for Bio in for Ma Tics
154/158
g
MD5 is a one-way hash function
e.g. gravatar.com uses MD5 to map(authenticated) email addresses to avatar icons
use Digest::MD5 qw(md5 md5_hex md5_base64);
my $baseURL = "http://www.gravatar.com/avatar/;
while () {
chomp;
print $baseURL, md5_hex(lc($_)), "\n;
}
Other programming languages
8/4/2019 Perl for Bio in for Ma Tics
155/158
p g g g g
Procedural languages Interpreted/scripting languages
"Shell languages (TCSH, BASH, CSH)
Python: cleaner, object-oriented
Ruby: even more object-oriented
Compiled languages C: very basic, portable and fast
C++: more elaborate, object-oriented C
Java: stripped-down portable C++; "safer & cleaner
Functional languages More mathematical, cleaner; but less pragmatic
Lisp, Scheme Lisp is the oldest. (Lots (of (parentheses)))
Prolog, ML, Haskell
8/4/2019 Perl for Bio in for Ma Tics
156/158
Co-ordinate transformation
8/4/2019 Perl for Bio in for Ma Tics
157/158
Motivation: map clones to chromosomesChromosome
Clones
17455 17855
403 803
Co-ordinate transformations (cont.)
8/4/2019 Perl for Bio in for Ma Tics
158/158
What if a segment spans multiple clones?
Recommended