26
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Some frequently-used Bioinformatics Tools Tools Konstantinos Mavrommatis Prokaryotic Superprogram

Some frequently-used Bioinformatics Tools

  • Upload
    gella

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Some frequently-used Bioinformatics Tools. Konstantinos Mavrommatis Prokaryotic Superprogram. Outline. Pairwise Alignment Global/Local, Scoring BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast Multiple Sequence Alignment - PowerPoint PPT Presentation

Citation preview

Page 1: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Some frequently-used Some frequently-used Bioinformatics ToolsBioinformatics Tools

Konstantinos Mavrommatis

Prokaryotic Superprogram

Page 2: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

OutlineOutline

Pairwise Alignment Global/Local, Scoring BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast

Multiple Sequence Alignment ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA,

DIALIGN, Match-Box, Multalin, MUSCA

Phylogenetic analysis and tree construction BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr,

POWER, BlastO, TraceSuite II

HMM Protein family profiles

http://expasy.org/tools/

Page 3: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

AlignmentAlignment

Insert spaces in arbitrary locations -> same length and no two spaces in the same position.

Find arrangement of two sequences to identify regions of similarity

Page 4: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Alignment methods: Dot Alignment methods: Dot plotsplots

Page 5: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Global vs Local Global vs Local alignmentalignment

Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another

Local alignment: An alignment that searches for segments of the two sequences that match well

It may seem that one should always use local alignments! However each has its application

Page 6: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Substitution matricesSubstitution matrices

http://www.russelllab.org/aas/

Page 7: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Scoring an alignmentScoring an alignment

Page 8: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Global alignmentGlobal alignment

S1=HGSAQVKGHGS2=KTEAEMKASEDLKKHGT

Page 9: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

KTEAEMKAESEDLKKHGT--HG--SA--Q-VKGHG-

Page 10: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Local AlignmentLocal Alignment

Page 11: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

How BLAST worksHow BLAST works

MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGGGG

VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG

Common 3mer

GCQSQCGG extend

Query

Subject (database)

++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG

HSP

Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%)

Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG 58 ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGGSbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67

Page 12: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Types of BlastTypes of Blast

Nucleic sequence:atcgatatatatagactgactgact

Protein sequence:MTAVYHILRALRARARVARARVH

6 frame translation

Nucleic acids sequence database

Nucleic acids sequence database

Protein seqeunces database

Protein seqeunces database

blastnblastn

blastpblastp

6 frame translationtblastxtblastx

blastxblastx

tblastntblastn

DatabaseQuery

Page 13: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Page 14: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Exact multiple alignment by Exact multiple alignment by dynamic programmingdynamic programming

Compexity= O(nS2SS2)N: length of sequencesS: number of sequences

Only feasible for 4-5 sequences max.

Page 15: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Page 16: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Neighbor JoiningNeighbor Joining

Page 17: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Unrooted NJ treeUnrooted NJ tree

Page 18: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Comparison of Multiple Comparison of Multiple sequence alignment sequence alignment

programsprograms

Page 19: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Primary sequence Primary sequence changes:changes:

Page 20: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

ProfilesProfiles

CGGSV

0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031

ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln(0.2) = -3.48

Page 21: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Hidden Markov ModelsHidden Markov Models

Assumptions: Observations are ordered Random process can be represented by a stochastic finite

state machine with emitting states

Probabilistic parameters of a Hidden Markov Model

x – states, y – possible observations

a – state transition probabilities, b –output/emision probabilities

Page 22: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

HMM estimation, usage & HMM estimation, usage & applicationsapplications

Training/Estimation

Feed an architecture (given in advance) a set of observation sequences

The training process will iteratively alter its parameters to fit the training set

The trained model will assign the training sequences high probabilities

Usage

Evaluate the probability of an observation sequence given the model (Forward)

Find the most likely path through the model for a given observation sequence (Viterbi)

Applications

Gene finding

Protein family modeling

Page 23: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Profile HMMsProfile HMMs

Families of functional biological sequences Primary sequences have diverged due to evolution,

while maintaining structure/function. Questions:

Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin?

Given a set of sequences, find more sequences of the same family

Page 24: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Page 25: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Trade offsTrade offs

Advandages Disadvandages•Statistics•Modularity•Transparency•Prior knowledge

•State independence•Over – fitting•Local maximums•Speed

Page 26: Some frequently-used Bioinformatics Tools

MGM workshop. 19 Oct 2010

Questions?