Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Bio-Informatics LecturesA Short Introduction
The History of Bioinformatics
Sanger Sequencing
PCR in presence of fluorescent, chain-terminating dideoxynucleotides
Massively Parallel Sequencing
Massively Parallel Sequencing
Illumina/Solexa
Roche/454, Emulsion PCR
Metzker, Nature Review: Genetics (11):31-46
Illumina/Solexa: Solid-Phase Amplification
http://www.genome.gov/sequencingcosts/
http://www.genome.gov/sequencingcosts/
~200 million sequences
1000 billion basesGrowth of GenBank and WGS
http://www.ncbi.nlm.nih.gov/genbank/statistics
Growth of UniProtKB/TrEMBL
http://www.ebi.ac.uk/uniprot/TrEMBLstats
How Does the Sequence Information Tell Us?
How Does the Sequence Information Tell Us?
Bio-Informatics
Scope of this lab
DATABASES: GenBank-http://www.ncbi.nlm.nih.gov EMBL-http://www.ebi.ac.uk DDBJ-http://www.ddbj.nig.ac.jp
Sequence Search and Retrieval: BLAST Sequence Alignement: ClustalW2, MAFFT Sequences Analysis and Domain Search: Pfam and SMART Protein Structure and Prediction: Pymol Molecular Evolution: MEGA
1. Be familiar with sequence databases and some online bioinformatics tools
http://www.ebi.ac.uk/services/all
More Tools to Discover on Your Own
http://www.expasy.org
Online Tools
Scope of this lab
2. Touch Some Simple Programming (Stand-alone)
Basic UNIX Commands: cd, mkdir, mv. cp, rm, cat, ls, pwd, gunzip, unzip, tar
Perl: String, Array, Hash
R: Read a file, column, row, plot, hist, heat map
Beginning with a DNA Sequence
Proteins
The primary sequence, structure, and function
of a protein are inter-related
MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLRLRGG
N-termnus
C-termnus
Database Sequence Similarity Searching
Definition: Applies computation, mathematical algorithms, statistical inference to rapidly find similar sequences (hits) to a target (query) sequence from a database.
All similarity searching methods rely on the concepts of alignment between sequences.
A similarity score is calculated from a distance: the number of DNA bases or amino acids that are different between two sequences.
Edit Distance
Edit Distance
Sequence Alignement and Dynamic Programming
Sequence Alignement Comparison and Substitution Matrix
Some popular scoring matrices are:
PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required.
BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.
Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix
Some popular scoring matrices are:
PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required.
BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.
Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix
Some popular scoring matrices are:
PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required.
BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.
Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix
Some popular scoring matrices are:
PAM (Point Accepted Mutation): for evolutionary studies. For example in PAM1, 1 accepted point mutation per 100 amino acids is required.
BLOSUM (BLOcks amino acid Substitution Matrix): for finding common motifs. For example in BLOSUM62, the alignment is created using sequences sharing no more than 62% identity.
Experimentation has shown that the BLOSUM-62 matrix is among the best for detecting most weak protein similarities.
Sequence Alignement Comparison and Substitution Matrix
Sequence Alignement Comparison and Substitution Matrix
Log-odds matrices
Local and Global Alignements
Smith-Waterman
Needleman-Wunsch
BLAST/FASTA Search and k-Tuple Method
Use proteins for database similarity searches when possible
Lab 1
Sequence Search and Retrieval: BLAST Sequence Alignement: ClustalW2, MAFFT Sequences Analysis and Domain Search: Pfam and SMART Protein Structure and Prediction: Pymol Molecular Evolution: MEGA
Sequence Format - Fasta
>AT4G05320 ATGCAGATCTTTGTTAAGACTCTCACCGGAAAGACAATCACCCTCGAGGTGGAAAGCTCCGACACCATCGACAACGTTAAGGCCAAGATCCAGGATAAGGAGGGCATTCCTCCGGATCAGCAGAGGCTTATTTTCGCCGGCAAGCAGCTAGAGGATGGCCGTACGTTGGCTGATTACAATATCCAGAAGGAATCCACCCTCCACTTGGTCCTCAGGCTCCGTGGTGGTATGCAGATTTTCGTTAAAACCCTAACGGGAAAGACGATTACTCTTGAGGTGGAGAGTTCTGACACCATCGACAACGTCAAGGCCAAGATCCAAGACAAAGAGGGTATTCCTCCGGACCAGCAGAGGCTGATCTTCGCCGGAAAGCAGTTGGAGGATGGCAGAACTCTTGCTGACTACAATATCCAGAAGGAGTCCACCCTTCATCTTGTTCTCAGGCTCCGTGGTGGTATGCAGATTTTCGTTAAGACGTTGACTGGGAAAACTATCACTTTGGAGGTGGAGAGTTCTGACACCATTGATAACGTGAAAGCCAAGATCCAAGACAAAGAGGGTATTCCTCCGGACCAGCAGAGATTGATCTTCGCCGGAAAACAACTTGAAGATGGCAGAACTTTGGCCGACTACAACATTCAGAAGGAGTCCACACTCCACTTGGTCTTGCGTCTGCGTGGAGGTATGCAGATCTTCGTGAAGACTCTCACCGGAAAGACCATCACTTTGGAGGTGGAGAGTTCTGACACCATTGATAACGTGAAAGCCAAGATCCAGGACAAAGAGGGTATCCCACCGGACCAGCAGAGATTGATCTTCGCCGGAAAGCAACTTGAAGATGGAAGAACTTTGGCTGACTACAACATTCAGAAGGAGTCCACACTTCACTTGGTCTTGCGTCTGCGTGGAGGTATGCAGATCTTCGTGAAGACTCTCACCGGAAAGACTATCACTTTGGAGGTAGAGAGCTCTGACACCATTGACAACGTGAAGGCCAAGATCCAGGATAAGGAAGGAATCCCTCCGGACCAGCAGAGGTTGATCTTTGCCGGAAAACAATTGGAGGATGGTCGTACTTTGGCGGATTACAACATCCAGAAGGAGTCGACCCTTCACTTGGTGTTGCGTCTGCGTGGAGGTATGCAGATCTTCGTCAAGACTTTGACCGGAAAGACCATCACCCTTGAAGTGGAAAGCTCCGACACCATTGACAACGTCAAGGCCAAGATCCAGGACAAGGAAGGTATTCCTCCGGACCAGCAGCGTCTCATCTTCGCTGGAAAGCAGCTTGAGGATGGACGTACTTTGGCCGACTACAACATCCAGAAGGAGTCTACTCTTCACTTGGTCCTGCGTCTTCGTGGTGGTTTCTAA
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
E value: is the expectation value or probability to find by chance hits similar to your sequence. The lower the E, the more significant the score.
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - BLAST
Lab 1 - Domain Search
Lab 1 - Domain Search
Lab 1 - Domain Search
Lab 1 - Structure Visualization
Pymol
Lab 1 - Phylogenetics
UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Maximum likelihood
Maximum parsimony
Neighbor joining
MrBayes: Bayesian Inference of Phylogeny