Upload
gella
View
36
Download
0
Embed Size (px)
DESCRIPTION
Some frequently-used Bioinformatics Tools. Konstantinos Mavrommatis Prokaryotic Superprogram. Outline. Pairwise Alignment Global/Local, Scoring BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast Multiple Sequence Alignment - PowerPoint PPT Presentation
Citation preview
MGM workshop. 19 Oct 2010
Some frequently-used Some frequently-used Bioinformatics ToolsBioinformatics Tools
Konstantinos Mavrommatis
Prokaryotic Superprogram
MGM workshop. 19 Oct 2010
OutlineOutline
Pairwise Alignment Global/Local, Scoring BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast
Multiple Sequence Alignment ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA,
DIALIGN, Match-Box, Multalin, MUSCA
Phylogenetic analysis and tree construction BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr,
POWER, BlastO, TraceSuite II
HMM Protein family profiles
http://expasy.org/tools/
MGM workshop. 19 Oct 2010
AlignmentAlignment
Insert spaces in arbitrary locations -> same length and no two spaces in the same position.
Find arrangement of two sequences to identify regions of similarity
MGM workshop. 19 Oct 2010
Alignment methods: Dot Alignment methods: Dot plotsplots
MGM workshop. 19 Oct 2010
Global vs Local Global vs Local alignmentalignment
Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another
Local alignment: An alignment that searches for segments of the two sequences that match well
It may seem that one should always use local alignments! However each has its application
MGM workshop. 19 Oct 2010
Substitution matricesSubstitution matrices
http://www.russelllab.org/aas/
MGM workshop. 19 Oct 2010
Scoring an alignmentScoring an alignment
MGM workshop. 19 Oct 2010
Global alignmentGlobal alignment
S1=HGSAQVKGHGS2=KTEAEMKASEDLKKHGT
MGM workshop. 19 Oct 2010
KTEAEMKAESEDLKKHGT--HG--SA--Q-VKGHG-
MGM workshop. 19 Oct 2010
Local AlignmentLocal Alignment
MGM workshop. 19 Oct 2010
How BLAST worksHow BLAST works
MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGGGG
VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG
Common 3mer
GCQSQCGG extend
Query
Subject (database)
++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG
HSP
Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%)
Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG 58 ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGGSbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67
MGM workshop. 19 Oct 2010
Types of BlastTypes of Blast
Nucleic sequence:atcgatatatatagactgactgact
Protein sequence:MTAVYHILRALRARARVARARVH
6 frame translation
Nucleic acids sequence database
Nucleic acids sequence database
Protein seqeunces database
Protein seqeunces database
blastnblastn
blastpblastp
6 frame translationtblastxtblastx
blastxblastx
tblastntblastn
DatabaseQuery
MGM workshop. 19 Oct 2010
MGM workshop. 19 Oct 2010
Exact multiple alignment by Exact multiple alignment by dynamic programmingdynamic programming
Compexity= O(nS2SS2)N: length of sequencesS: number of sequences
Only feasible for 4-5 sequences max.
MGM workshop. 19 Oct 2010
MGM workshop. 19 Oct 2010
Neighbor JoiningNeighbor Joining
MGM workshop. 19 Oct 2010
Unrooted NJ treeUnrooted NJ tree
MGM workshop. 19 Oct 2010
Comparison of Multiple Comparison of Multiple sequence alignment sequence alignment
programsprograms
MGM workshop. 19 Oct 2010
Primary sequence Primary sequence changes:changes:
MGM workshop. 19 Oct 2010
ProfilesProfiles
CGGSV
0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031
ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln(0.2) = -3.48
MGM workshop. 19 Oct 2010
Hidden Markov ModelsHidden Markov Models
Assumptions: Observations are ordered Random process can be represented by a stochastic finite
state machine with emitting states
Probabilistic parameters of a Hidden Markov Model
x – states, y – possible observations
a – state transition probabilities, b –output/emision probabilities
MGM workshop. 19 Oct 2010
HMM estimation, usage & HMM estimation, usage & applicationsapplications
Training/Estimation
Feed an architecture (given in advance) a set of observation sequences
The training process will iteratively alter its parameters to fit the training set
The trained model will assign the training sequences high probabilities
Usage
Evaluate the probability of an observation sequence given the model (Forward)
Find the most likely path through the model for a given observation sequence (Viterbi)
Applications
Gene finding
Protein family modeling
…
MGM workshop. 19 Oct 2010
Profile HMMsProfile HMMs
Families of functional biological sequences Primary sequences have diverged due to evolution,
while maintaining structure/function. Questions:
Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin?
Given a set of sequences, find more sequences of the same family
MGM workshop. 19 Oct 2010
MGM workshop. 19 Oct 2010
Trade offsTrade offs
Advandages Disadvandages•Statistics•Modularity•Transparency•Prior knowledge
•State independence•Over – fitting•Local maximums•Speed
MGM workshop. 19 Oct 2010
Questions?