View
235
Download
3
Embed Size (px)
Citation preview
4.1
More loops
4.2Loops
Commands inside a loop are executed repeatedly (iteratively):
my $num=0;
print "Guess a number.\n";
while ($num != 31) {
$num = <STDIN>;
}
print "correct!\n";my @names = <STDIN>;
chomp(@names);
my $name;
foreach $name (@names) {
print "Hello $name!\n";
}
4.3Loops: for
The for loop is controlled by three statements:
• 1st is executed before the first iteration
• 2nd is the stop condition
• 3rd is executed before every re-iteration
for (my $i=0; $i<10; $i++) {
print "$i\n";
}
my $i=0;
while ($i<10){
print "$i\n";
$i++;
}
These are equivalent
4.4Breaking out of loops
next – skip to the next iteration last – skip out of the loop
my @lines = <STDIN>;
foreach $line (@lines) {
if (substr($line,0,1) eq ">") { next; }
if (substr($line,0,8) eq "**stop**") { last; }
print $line;
}
4.5Breaking out of loops
die – end the program and print an error message to the standard error <STDERR>
if ($score < 0) { die "score must be positive"; }
score must be positive at test.pl line 8.
Note: if you end the string with a "\n" then only your message will be printed
* warn does the same thing as die without ending the program
4.6
The Programming Process
4.7The programming process
It pays to plan ahead before writing a computer program:
1. Define the purpose of the program
2. Identify the required inputs
3. Decide how to present the outputs
4. Make an overall design of the program
5. Refine the design, specify more details
6. Write the code – one stage at a time and test each stage
7. Debug…
4.8An example: SAGE libraries
1. Double-stranded cDNA is generated from cell extracts
2. The cDNA is cleaved with a restriction enzyme (NlaIII)
3. The most 3'-end of the cDNA is then collected by their poly-A
4. The fragments are ligated to linkers containing a recognition site for a type IIS restriction enzyme and a PCR primer site
5. This restriction enzyme cuts 15bp away from its recognition site
6. Ligation, PCR, cleavage, concatenation, cloning, sequencing… A 10bp tag sequence from each mRNA
7. 10bp sequences are searched in an mRNA database and the corresponding genes are identified
SAGE (Serial Analysis of Gene Expression) is used to identify all transcripts that are expressed in a tissue:
(1)
(2&3)
(4&5)
4.9
An example: SAGE
libraries
SAGE (Serial Analysis of
Gene Expression) is used to
identify all transcripts that
are expressed in a tissue:
4.10Predicting the SAGE tag of an mRNA
It would be useful to know what tag to expect for each mRNA in the database.
So lets write a script:
1. Purpose: To predict the 10bp sequence of the SAGE tag of a given mRNA
2. Inputs: A list of mRNA sequences in FASTA format
>gi|24646380|ref|NM_079608.2| Mus musculus EH-domain containing 4 (EHD4), mRNA GTGGTATTTCTTCGTTGTCTCTGGCGTGGTCACGTTGATTGGTCCGCTATCTGGACCGAAAAAAGTCGTA......GTCGACGGCGATGGGTTCCTGGACTCTGACGAGTTCGCGCTGGCCTTGCACTTAATCAACGTCAAGCTGGAAGGCTGCGAGCTGCCCACCGTGCTGCCGGAGCACTTAGTACCGCCGTCGAAGCGCTATGACTAGTGTCCTGTAGCATACGCATACGCACACTAGATCACACAGCCTCACAATTCCCAAAAAAAAAAAAAAAA
>gi|71895640|ref|NM_001031040.1| Mus musculus EH-domain containing 3 (EHD3), mRNAGGTAGGGCGCTACCGCCTCCGCCCGCCTCTCGCGCTGTTCCTCCGCGGTATGCCCGCGCCGGCAGCCGGC......TATTATATAGAGAAATATATTGTGTATGTAGGATGTGCTTATTGCATTACATTTATCACTTGTCTTAACTAGAATGCATTAACCTTTTTTGTACCCTGGTCCTAAAACATTATTAAAAAGAAAGGCTAAAAAAAAAAAAAAAAA
>gi|55742710|ref|NM_153068.2| Mus musculus EH-domain containing 2 (Ehd2), mRNATGAGGGGGCCTGGGGCCCGCCCTGCTCGCCGCTCCTAGCGCACGCGGCCCCACCCGTCTCACTCCACTGC......
4.11
3. Decide how to present the results
Simply print the header line of each mRNA and then it’s predicted 10bp tag, like so:
> gi|24646380|ref|NM_079608.2| Mus musculus EH-domain containing 4 (EHD4), mRNAATCACACAGC
>gi|71895640|ref|NM_001031040.1| Mus musculus EH-domain containing 3 (EHD3), mRNAAATGCATTAA
...
...
4.12
4. Overall design:
1. For each mRNA in the input:
1. Read the sequence
2. Find the most downstream recognition site of NlaII (CTAG)
3. Get the 10bp tag after that site
4. Print it
4.13
Read sequence
Find most downstream CTAG
Get the 10bp tag
Print the tag
End of input? No
End
StartFlow diagram:
4.14
5. Refine the design, specify more details:
1. For each mRNA in the input (use a loop):
1. Read the sequence
1. Store its header line in one string variable
2. Concatenate all lines of the sequence and store it in another string variable
2. Find the most downstream recognition site of NlaII (CTAG)
1. Go over the sequence with a loop, starting from the 3’ tail, and going back until the first CTAG is found
3. Get the 10bp tag after that site
1. Take a substr of length 10
4. Print it
6. Write the code
4.15
Read sequence
Find most downstream CTAG
Get the 10bp tag
Print the tag
End of input? No
End
Start
Save header
Read line
Header?
Yes
Concatenate to sequence
No
Read line
Read line
4.16
Start pos. at end of sequence
Check pos. for “CTAG”
“CTAG” at pos?pos--
Yes
Read sequence
Find most downstream CTAG
Get the 10bp tag
Print the tag
End of input? No
End
Start
Pos < 0?
4.17
Start pos. at end of sequence
Check pos. for “CTAG”
“CTAG” at pos?pos--
Yes
Pos < 0?
Yes
Find most downstream CTAG
Print “no tag”
4.18
Start pos. at end of sequence
Check pos. for “CTAG”
“CTAG” at pos?pos--
Yes
Pos < 0?
Yes
Pos < 0?Yes No
Print tagPrint “no tag”
Find most downstream CTAG
4.19FASTA: Analyzing complex input
Overall design:
1. Read the sequence
2. Do something
Let’s see how it’s done…
Do something
End of input? No
End
Start
Save header
Read line
Header?
Yes
Concatenate to sequence
No
Read line
Read line
4.20$line = <STDIN>;
my $endOfInput = 0;while ($endOfInput==0) {
# 1.1. Read sequence name from FASTA headerif (substr($line,0,1) eq ">") {
$name = substr($line,1);} else...
# 1.2. Read sequence until next FASTA header$seq = "";$line = <STDIN>;while (substr($line,0,1) ne ">") {
$seq = $seq . $line;$line = <STDIN>;if (!defined($line)) {
$endOfInput = 1;last;
}}
# 2. Do something...}
Do something
End of input? No
End
Start
Save header
Read line
Header?
Yes
Concatenate to sequence
No
Read line
Read line
4.21#################################### 1. Foreach sequence in the inputmy (@lines, $line, $name, $seq);$line = <STDIN>;chomp $line;
my $endOfInput = 0;while ($endOfInput==0) {
################################# 1.1. Read sequence name from FASTA headerif (substr($line,0,1) eq ">") {
$name = substr($line,1);} else {
die "bad FASTA format";}# 1.2. Read sequence until next FASTA header$seq = "";$line = <STDIN>;chomp $line;# Read until next header or end of inputwhile (substr($line,0,1) ne ">") {
$seq = $seq . $line;$line = <STDIN>;if (!defined($line)) {
$endOfInput = 1;last;
}chomp $line;
}
################################# 2. Do something...
}
Do something
End of input? No
End
Start
Save header
Read line
Header?
Yes
Concatenate to sequence
No
Read line
Read line
4.22 FASTA: An alternative approach(which is more confusing and generally not recommended!)
my @fasta = <STDIN>;my $oneline = join("", @fasta); # Concatenate all lines for ($i=0; $i<length($oneline); $i++){ my $c = substr($oneline,$i,1); my $sub10 = substr($oneline,$i,10);
if ($c eq ">") { # Save header start position $start = ($i+1); } if ($c eq "]") { # Save header end position $end = $i; } if(???) { # If we found what we were looking for... # Print last header $name = substr($oneline,$start,$end-$start+1); }}