21
BNFO 240 Usman Roshan

BNFO 240

  • Upload
    race

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

BNFO 240. Usman Roshan. Last time. Traceback for alignment How to select the gap penalties? Benchmark alignments Structural superimposition BAliBASE. Database searching. Suppose we have a set of 1,000,000 sequences - PowerPoint PPT Presentation

Citation preview

Page 1: BNFO 240

BNFO 240

Usman Roshan

Page 2: BNFO 240

Last time

• Traceback for alignment

• How to select the gap penalties?

• Benchmark alignments– Structural superimposition– BAliBASE

Page 3: BNFO 240

Database searching

• Suppose we have a set of 1,000,000 sequences

• You have a query sequence q and want to find the m closest ones in the database---that means 1,000,000 pairwise alignments!

• How to speed up pairwise alignments?

Page 4: BNFO 240

FASTA

• FASTA was the first software for quick searching of a database

• Introduced the idea of searching for k-mers

• Can be done quickly by preprocessing database

Page 5: BNFO 240

FASTA: combine high scoring hits into diagonal runs

Page 6: BNFO 240

BLAST

Key idea: search for k-mers (short matchig substrings) quickly by preprocessing the database.

Page 7: BNFO 240

BLAST

This key idea can also be used for speeding up pairwise alignments when doing multiple sequence alignments

Page 8: BNFO 240

Biologically realistic scoring matrices

• PAM and BLOSUM are most popular• PAM was developed by Margaret

Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins

• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

Page 9: BNFO 240

PAM

• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j

• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families

• Compute probabilities of change and background probabilities by simple counting

Page 10: BNFO 240

PAM• In this model the unit of evolution is the amount

of evolution that will change 1 in 100 amino acids on the average

The scoring matrix Sab is the ratio of Mab to pb

Page 11: BNFO 240

PAM Mij matrix (x10000)

Page 12: BNFO 240

Multiple sequence alignment

• “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk

• Computationally very hard---NP-hard

Page 13: BNFO 240

Formally…

Page 14: BNFO 240

Multiple sequence alignment

Unaligned sequences

GGCTT

TAGGCCTT

TAGCCCTTA

ACACTTC

ACTT

Aligned sequences

_G_ _ GCTT_

TAGGCCTT_

TAGCCCTTA

A_ _CACTTC

A_ _C_ CTT_ Conserved regions help us to identify functionality

Page 15: BNFO 240

Sum of pairs score

Page 16: BNFO 240

Sum of pairs score

• What is the sum of pairs score of this alignment?

Page 17: BNFO 240

Profiles

• Before we see how to construct multiple alignments, how do we align two alignments?

• Idea: summarize an alignment using its profile and align the two profiles

Page 18: BNFO 240

Profile alignment

Page 19: BNFO 240

Iterative alignment(heuristic for sum-of-pairs)

• Pick a random sequence from input set S• Do (n-1) pairwise alignments and align to

closest one t in S• Remove t from S and compute profile of

alignment• While sequences remaining in S

– Do |S| pairwise alignments and align to closest one t

– Remove t from S

Page 20: BNFO 240

Iterative alignment

• Once alignment is computed randomly divide it into two parts

• Compute profile of each sub-alignment and realign the profiles

• If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

Page 21: BNFO 240

Progressive alignment

• Idea: perform profile alignments in the order dictated by a tree

• Given a guide-tree do a post-order search and align sequences in that order

• Widely used heuristic