78
Bioinformatics 生生生生生生生生生生 生生生 [email protected] 13928761660

Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 13928761660

  • Upload
    urian

  • View
    142

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 13928761660. !/usr/bin/perl -w use Bio; use strict; use warnings; my $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Bioinformatics生物信息学理论和实践

唐继军

[email protected]

Page 2: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

!/usr/bin/perl -wuse Bio;use strict;use warnings;

my $DNA = fasta_read();

print "First ", dna2peptide($DNA), "\n";print "Second ", dna2peptide(substr($DNA, 1)), "\n";print "Third ", dna2peptide(substr($DNA, 2)), "\n";

$DNA = reverse $DNA;$DNA =~ tr/ACGTacgt/TGCAtgca/;

print "Fourth ", dna2peptide($DNA), "\n";print "Fifth ", dna2peptide(substr($DNA, 1)), "\n";print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

Page 3: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

my $x = 10;

for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n";}

print $x, "\n";

sub Scope { my $x = 0;}

Page 4: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

my $x = 10;

for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n";}

print $x, "\n";

sub Scope { $x = 0;}

Page 5: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

sub extract_sequence_from_fasta_data {

my(@fasta_file_data) = @_; my $sequence = '';

foreach my $line (@fasta_file_data) {

if ($line =~ /^\s*$/) { next; } elsif($line =~ /^\s*#/) { next; } elsif($line =~ /^>/) { next; } else { $sequence .= $line; } }

# remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g;

return $sequence;}

Page 6: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Molecular Scissors

Molecular Cell Biology, 4th edition

Page 7: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 8: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

R = G or A Y = C or T M = A or C K = G or T S = G or C W = A or T B = not A (C or G or T) D = not C (A or G or T) H = not G (A or C or T) V = not T (A or C or G) N = A or C or G or T

Page 9: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ‘’; my %iub2character_class = (

A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', );

$iub =~ s/\^//g;

for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression .= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression;}

Page 10: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Hash

• Initialize: my %hash = ();• Add key/value pair: $hash{$key} = $value;• Add more keys:

• %hash = ( 'key1', 'value1', 'key2', 'value2 );• %hash = ( key1 => 'value1', key2 => 'value2', );

• Delete: delete $hash{$key};

Page 11: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; }

foreach my $key ( keys %hash ) { my $value = $hash{$key}; print "$key => $value\n"; }

Page 12: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

sub parseREBASE { my($rebasefile) = @_; my @rebasefile = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp;

open($rebase_filehandle, $rebasefile) or die "Cannot open file\n";

while(<$rebase_filehandle>) {

# Discard header lines ( 1 .. /Rich Roberts/ ) and next;

# Discard blank lines /^\s*$/ and next; # Split the two (or three if includes parenthesized name) fields my @fields = split( " ", $_);

$name = shift @fields;

$site = pop @fields;

# Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site);

# Store the data into the hash $rebase_hash{$name} = "$site $regexp"; }

# Return the hash containing the reformatted REBASE data return %rebase_hash;}

Page 13: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Range

• ( 1 .. /Rich Roberts/ ) and next• from first line till some line containing Rich Roberts• If that is true, it will check the statement after "and"• If that is not true, it will not check the statement after

"and"• open(…) or die

• If can open, the statement is already true, no need to check the statement after "or"

• If cannot open, the statement is false, need to check the statement after "or" to see if it can be true

Page 14: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Array operators

• push and pop (right-most element)• @mylist = (1,2,3); push(@mylist,4,5,6);• $oldvalue = pop(@mylist);

• shift and unshift (left-most element)• @fred = (5,6,7); unshift(@fred,2,3,4);

• $x = shift(@fred); • reverse: @a = (7,8,9); @b = reverse(@a);• sort: @a = (7,9,9); @b = sort(@a);

Page 15: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

sub match_positions {

my($regexp, $sequence) = @_;

use BeginPerlBioinfo;

my @positions = ( );

while ( $sequence =~ /$regexp/ig ) {

push ( @positions, pos($sequence) - length($&) + 1); }

return @positions;}

Page 16: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

use BeginPerlBioinfo;

my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = '';my $regexp = ''; my @locations = ( );

@file_data = get_file_data("sample.dna");$dna = extract_sequence_from_fasta_data(@file_data);%rebase_hash = parseREBASE('bionet');

do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } }} until ( $query =~ /quit/ );

exit;

Page 17: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Print to file

• Open a file to print• open FILE, ">filename.txt";• open (FILE, ">filename.txt“);

• Print to the file• print FILE $str;

Page 18: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

#write new fileopen(FILE, ">out") or die "Cannot open file to write";

print FILE "Test\n";

close FILE;exit;

Page 19: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

#Appendopen(FILE, ">>out") or die "Cannot open file to write";

print FILE "Test\n";

close FILE;exit;

Page 20: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

#!/usr/bin/perlprint "My name is $0 \n";print "First arg is: $ARGV[0] \n";print "Second arg is: $ARGV[1] \n";print "Third arg is: $ARGV[2] \n";

$num = $#ARGV + 1; print "How many args? $num \n";print "The full argument string was: @ARGV \n";

Page 21: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

use BeginPerlBioinfo;

my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = '';my $regexp = ''; my @locations = ( );

@file_data = get_file_data($ARGV[0]);$dna = extract_sequence_from_fasta_data(@file_data);%rebase_hash = parseREBASE('bionet');

do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } }} until ( $query =~ /quit/ );

exit;

Page 22: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

use BeginPerlBioinfo;

my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = '';my $regexp = ''; my @locations = ( );

@file_data = get_file_data($ARGV[0]);$dna = extract_sequence_from_fasta_data(@file_data);%rebase_hash = parseREBASE($ARGV[1]);

do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } }} until ( $query =~ /quit/ );

exit;

Page 23: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Regular Expression• ^ beginning of string • $ end of string • . any character except newline • * match 0 or more times • + match 1 or more times • ? match 0 or 1 times; • | alternative • ( ) grouping; “storing” • [ ] set of characters • { } repetition modifier • \ quote or special

Page 24: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Repeats

• a*zero or more a’s • a+one or more a’s • a?zero or one a’s (i.e., optional a) • a{m}exactly m a’s • a{m,}at least m a’s • a{m,n}at least m but at most n a’s

Page 25: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

\

Page 26: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

[]

Page 27: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 28: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 29: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Perl tr/// function• tr means transliterate – replaces a character with

another character• $dna =~ tr/a/c/ replaces all “a” with “c” in in $dna• It also works on a range:

$dna =~ tr/a-z/A-Z/ replaces all lower case letters with upper case

• tr also counts$count = ($string =~ tr/A//)(you might think this also deletes all “A” from the string, but it doesn’t)

Page 30: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Wildcards• Perl has a set of wildcard characters for Reg. Exps.

that are completely different than the ones used by Unix • the dot (.) matches any character• \d matches any digit (a number from 0-9)• \w matches any text character

(a letter or number, not punctuation or space)

• \s matches white space (any amount)• ^ matches the beginning of a line• $ matches the end of a line

(Yes, this is very confusing!)

Page 31: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Repeat for a count

• Use curly brackets to show that a character repeats a specific number (or range) of times:

• find an EcoRI fragment of 100-500 bp length (two EcoRI sites with any other sequence between):

if $ecofrag =~ /GAATTC[GATC]{100,500}GAATTC/

• The + sign is used to indicate an unlimited number of repeats (occurs 1 or more times)

Page 32: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

my $mystring; $mystring = "Hello world!";

if($mystring =~ m/World/) { print "Yes"; }

if($mystring =~ m/World/i) { print "Yes"; }

Page 33: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Grabbing parts of a string• Regular expressions can do more than just ask ‘if”

questions• They can be used to extract parts of a line of text

into variables; Check this out:/^>(\w+)\s(. +)$/;

Complete gibberish, right?• It means:

-look for the > sign at the beginning of a FASTA formatted sequence file

-dump the first word (\w+) into variable $1 (the sequence ID) -after a space, dump the rest of the line (.+), until you

reach the end of line $, into variable $2 (the description)

Page 34: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

$mystring = "[2004/04/13] The date of this article.";

if($mystring =~ m/(\d)/) { print "The first digit is $1.";}

if($mystring =~ m/(\d+)/) { print "The first number is $1.";}

if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3";}

while($mystring =~ m/(\d+)/g) { print "Found number $1."; }

@myarray = ($mystring =~ m/(\d+)/g); print join(",", @myarray);

Page 35: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Working with Single DNA Sequences

Page 36: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Learning Objectives

• Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR

• Find out about gene-prediction methods, their potential, and their limitations

• Understand how genomes and sequences and assembled

Page 37: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Outline

1. Cleaning your DNA of contaminants2. Digesting your DNA in the computer3. Finding protein-coding genes in your DNA

sequence4. Assembling a genome

Page 38: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Cleaning DNA Sequences• In order to sequence genomes, DNA sequences are often

cloned in a vector (plasmid, YAC, or cosmide) • Sequences of the vector can be mixed with your DNA sequence• Before working with your DNA sequence, you should always

clean it with VecScreen

Page 39: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

VecScreen• http://www.ncbi.nlm.nih.gov/

VecScreen/VecScreen.html• Runs a special version of Blast• A system for quickly identifying

segments of a nucleic acid sequence that may be of vector origin

Page 40: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 41: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 42: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 43: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 44: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

What to do if hits found• If hits are in the extremity, can just

remove them• If in the middle, or vectors are not what

you are using, the safest thing is to throw the sequence away

Page 45: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Computing a Restriction Map• It is possible to cut DNA sequences using restriction enzymes

• Each type of restriction enzyme recognizes and cuts a different sequence:

• EcoR1: GAATTC

• BamH1: GGATCC

• There are more than 900 different restriction enzymes, each with a different specificity

• The restriction map is the list of all potential cleavage sites in a DNA molecule

• You can compile a restriction map with www.firstmarket.com/cutter

Page 46: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Cannot get it work!

Page 47: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

http://biotools.umassmed.edu/tacg4

Page 48: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 49: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Making PCR with a Computer• Polymerase Chain Reaction (PCR) is a method for amplifying DNA

• PCR is used for many applications, including• Gene cloning

• Forensic analysis

• Paternity tests

• PCR amplifies the DNA between two anchors

• These anchors are called the PCR primer

Page 50: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Designing PCR Primers• PCR primes are typically 20 nucleotides long

• The primers must hybridize well with the DNA

• On biotools.umassmed.edu, find the best location for the primers: • Most stable

• Longest extension

Page 51: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 52: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 53: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 54: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Analyzing DNA Composition• DNA composition varies a lot• Stability of a DNA sequence depends on its G+C

content (total guanine and cytosine)• High G+C makes very stable DNA molecules• Online resources are available to measure the

GC content of your DNA sequence• Also for counting words and internal repeats

Page 55: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

http://helixweb.nih.gov/emboss/html/

Page 56: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Counting words

• ATGGCTGACT• A, T, G, G, C, T, G, A, C, T• AT, TG, GG, GC, CT, TG, GA, AC, CT• ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT

Page 57: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

www.genomatix.de/cgi-bin/tools/tools.pl

Page 58: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

EMBOSS servers

• European Molecular Biology Open Software Suite

• http://pro.genomics.purdue.edu/emboss/

Page 59: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 60: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 61: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 62: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

ORF

• EMBOSS• NCBI

Page 63: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 64: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 65: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

ncbi.nlm.nih.gov/gorf/gorf.html

Page 66: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 67: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Internal repeats

• A word repeated in the sequence, long enough to not occur by chance

• Can be imperfect (regular expression)• Dot plot is the best way to spot it

Page 68: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

arbl.cvmbs.colostate.edu/molkit

Page 69: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660
Page 70: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Predicting Genes

• The most important analysis carried out on DNA sequences is gene prediction

• Gene prediction requires different methods for eukaryotes and prokaryotes

• Most gene-prediction methods use hidden Markov Models

Page 71: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Predicting Genes in Prokaryotic Genome

• In prokaryotes, protein-coding genes are uninterrupted• No introns

• Predicting protein-coding genes in prokaryotes is considered a solved problem• You can expect 99% accuracy

Page 72: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Finding Prokaryotic Genes with GeneMark

• GeneMark is the state of the art for microbial genomes

• GeneMark can• Find short proteins• Resolve overlapping genes• Identify the best start codon

• Use exon.gatech.edu/GeneMark

• Click the “heutistic models”

Page 73: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Predicting Eukaryotic Genes

• Eukaryotic genes (human, for example) are very hard to predict

• Precise and accurate eukaryotic gene prediction is still an open problem• ENSEMBL contains 21,662 genes for the human genome

• There may well be more genes than that in the genome, as yet unpredicted

• You can expect 70% accuracy on the human genome with automatic methods

• Experimental information is still needed to predict eukaryotic genes

Page 74: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Finding Eukaryotic Genes with GenomeScan

• GenomeScan is the state of the art for eukaryotic genes

• GenomeScan works best with• Long exons• Genes with a low GC content

• It can incorporate experimental information

• Use genes.mit.edu/genomescan

Page 75: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Producing Genomic Data• Until recently, sequencing an entire genome was very

expensive and difficult

• Only major institutes could do it

• Today, scientists estimate that in 10 years, it will cost about $1000 to sequence a human genome

• With sequencing so cheap, assembling your own genomes is becoming an option

• How could you do it?

Page 76: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Sequencing and Assembling a Genome (I)

• To sequence a genome, the first task is to cut it into many small, overlapping pieces

• Then clone each piece

Page 77: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Sequencing and Assembling a Genome (II)

• Each piece must be sequenced• Sequencing machines cannot do an entire sequence at once

• They can only produce short sequences smaller than 1 Kb• These pieces are called reads

• It is necessary to assemble the reads into contigs

Page 78: Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Sequencing and Assembling a Genome (III)

• The most popular program for assembling reads is PHRAP • Available at www.phrap.org

• Other programs exist for joining smaller datasets• For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php