4.1 More loops. 4.2 Loops Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.\n"; while ($num != 31) { $num

$Page 1: 4.1 More loops. 4.2 Loops Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.\n"; while ($num != 31) { $num$
4.1

More loops

4.2Loops

Commands inside a loop are executed repeatedly (iteratively):

my $num=0;

print "Guess a number.\n";

while ($num != 31) {

$num = <STDIN>;

}

print "correct!\n";my @names = <STDIN>;

chomp(@names);

my $name;

foreach $name (@names) {

print "Hello $name!\n";

}

4.3Loops: for

The for loop is controlled by three statements:

• 1st is executed before the first iteration

• 2nd is the stop condition

• 3rd is executed before every re-iteration

for (my $i=0; $i<10; $i++) {

print "$i\n";

}

my $i=0;

while ($i<10){

print "$i\n";

$i++;

}

These are equivalent

4.4Breaking out of loops

next – skip to the next iteration last – skip out of the loop

my @lines = <STDIN>;

foreach $line (@lines) {

if (substr($line,0,1) eq ">") { next; }

if (substr($line,0,8) eq "**stop**") { last; }

print $line;

}

4.5Breaking out of loops

die – end the program and print an error message to the standard error <STDERR>

if ($score < 0) { die "score must be positive"; }

score must be positive at test.pl line 8.

Note: if you end the string with a "\n" then only your message will be printed

* warn does the same thing as die without ending the program

4.6

The Programming Process

4.7The programming process

It pays to plan ahead before writing a computer program:

1. Define the purpose of the program

2. Identify the required inputs

3. Decide how to present the outputs

4. Make an overall design of the program

5. Refine the design, specify more details

6. Write the code – one stage at a time and test each stage

7. Debug…

4.8An example: SAGE libraries

1. Double-stranded cDNA is generated from cell extracts

2. The cDNA is cleaved with a restriction enzyme (NlaIII)

3. The most 3'-end of the cDNA is then collected by their poly-A

4. The fragments are ligated to linkers containing a recognition site for a type IIS restriction enzyme and a PCR primer site

5. This restriction enzyme cuts 15bp away from its recognition site

6. Ligation, PCR, cleavage, concatenation, cloning, sequencing… A 10bp tag sequence from each mRNA

7. 10bp sequences are searched in an mRNA database and the corresponding genes are identified

SAGE (Serial Analysis of Gene Expression) is used to identify all transcripts that are expressed in a tissue:

(1)

(2&3)

(4&5)

4.9

An example: SAGE

libraries

SAGE (Serial Analysis of

Gene Expression) is used to

identify all transcripts that

are expressed in a tissue:

4.10Predicting the SAGE tag of an mRNA

It would be useful to know what tag to expect for each mRNA in the database.

So lets write a script:

1. Purpose: To predict the 10bp sequence of the SAGE tag of a given mRNA

2. Inputs: A list of mRNA sequences in FASTA format

>gi|24646380|ref|NM_079608.2| Mus musculus EH-domain containing 4 (EHD4), mRNA GTGGTATTTCTTCGTTGTCTCTGGCGTGGTCACGTTGATTGGTCCGCTATCTGGACCGAAAAAAGTCGTA......GTCGACGGCGATGGGTTCCTGGACTCTGACGAGTTCGCGCTGGCCTTGCACTTAATCAACGTCAAGCTGGAAGGCTGCGAGCTGCCCACCGTGCTGCCGGAGCACTTAGTACCGCCGTCGAAGCGCTATGACTAGTGTCCTGTAGCATACGCATACGCACACTAGATCACACAGCCTCACAATTCCCAAAAAAAAAAAAAAAA

>gi|71895640|ref|NM_001031040.1| Mus musculus EH-domain containing 3 (EHD3), mRNAGGTAGGGCGCTACCGCCTCCGCCCGCCTCTCGCGCTGTTCCTCCGCGGTATGCCCGCGCCGGCAGCCGGC......TATTATATAGAGAAATATATTGTGTATGTAGGATGTGCTTATTGCATTACATTTATCACTTGTCTTAACTAGAATGCATTAACCTTTTTTGTACCCTGGTCCTAAAACATTATTAAAAAGAAAGGCTAAAAAAAAAAAAAAAAA

>gi|55742710|ref|NM_153068.2| Mus musculus EH-domain containing 2 (Ehd2), mRNATGAGGGGGCCTGGGGCCCGCCCTGCTCGCCGCTCCTAGCGCACGCGGCCCCACCCGTCTCACTCCACTGC......

4.11

3. Decide how to present the results

Simply print the header line of each mRNA and then it’s predicted 10bp tag, like so:

> gi|24646380|ref|NM_079608.2| Mus musculus EH-domain containing 4 (EHD4), mRNAATCACACAGC

>gi|71895640|ref|NM_001031040.1| Mus musculus EH-domain containing 3 (EHD3), mRNAAATGCATTAA

...

...

4.12

4. Overall design:

1. For each mRNA in the input:

1. Read the sequence

2. Find the most downstream recognition site of NlaII (CTAG)

3. Get the 10bp tag after that site

4. Print it

4.13

Read sequence

Find most downstream CTAG

Get the 10bp tag

Print the tag

End of input? No

End

StartFlow diagram:

4.14

5. Refine the design, specify more details:

1. For each mRNA in the input (use a loop):


1. Store its header line in one string variable

2. Concatenate all lines of the sequence and store it in another string variable

2. Find the most downstream recognition site of NlaII (CTAG)

1. Go over the sequence with a loop, starting from the 3’ tail, and going back until the first CTAG is found

3. Get the 10bp tag after that site

1. Take a substr of length 10

4. Print it

6. Write the code

4.15

Read sequence


Get the 10bp tag

Print the tag

End of input? No

End

Start

Save header

Read line

Header?

Yes

Concatenate to sequence

No

Read line

Read line

4.16

Start pos. at end of sequence

Check pos. for “CTAG”

“CTAG” at pos?pos--

Yes

Read sequence


Get the 10bp tag

Print the tag

End of input? No

End

Start

Pos < 0?

4.17




Yes

Pos < 0?

Yes


Print “no tag”

4.18




Yes

Pos < 0?

Yes

Pos < 0?Yes No

Print tagPrint “no tag”


4.19FASTA: Analyzing complex input

Overall design:


2. Do something

Let’s see how it’s done…

Do something

End of input? No

End

Start

Save header

Read line

Header?

Yes


No

Read line

Read line

4.20$line = <STDIN>;

my $endOfInput = 0;while ($endOfInput==0) {

# 1.1. Read sequence name from FASTA headerif (substr($line,0,1) eq ">") {

$name = substr($line,1);} else...

# 1.2. Read sequence until next FASTA header$seq = "";$line = <STDIN>;while (substr($line,0,1) ne ">") {

$seq = $seq . $line;$line = <STDIN>;if (!defined($line)) {

$endOfInput = 1;last;

}}

# 2. Do something...}

Do something

End of input? No

End

Start

Save header

Read line

Header?

Yes


No

Read line

Read line

4.21#################################### 1. Foreach sequence in the inputmy (@lines, $line, $name, $seq);$line = <STDIN>;chomp $line;

my $endOfInput = 0;while ($endOfInput==0) {

################################# 1.1. Read sequence name from FASTA headerif (substr($line,0,1) eq ">") {

$name = substr($line,1);} else {

die "bad FASTA format";}# 1.2. Read sequence until next FASTA header$seq = "";$line = <STDIN>;chomp $line;# Read until next header or end of inputwhile (substr($line,0,1) ne ">") {

$seq = $seq . $line;$line = <STDIN>;if (!defined($line)) {

$endOfInput = 1;last;

}chomp $line;

}

################################# 2. Do something...

}

Do something

End of input? No

End

Start

Save header

Read line

Header?

Yes


No

Read line

Read line

4.22 FASTA: An alternative approach(which is more confusing and generally not recommended!)

my @fasta = <STDIN>;my $oneline = join("", @fasta); # Concatenate all lines for ($i=0; $i<length($oneline); $i++){ my $c = substr($oneline,$i,1); my $sub10 = substr($oneline,$i,10);

if ($c eq ">") { # Save header start position $start = ($i+1); } if ($c eq "]") { # Save header end position $end = $i; } if(???) { # If we found what we were looking for... # Print last header $name = substr($oneline,$start,$end-$start+1); }}

Documents

4.1 More loops. 4.2 Loops Commands inside a loop are executed repeatedly (iteratively): my $num=0; print "Guess a number.\n"; while ($num != 31) { $num