Basics of sequence analysis Ch.6 and ChBasics of sequence analysis Ch.6 and Ch.7 • Sequence...

Preview:

Citation preview

Basics of sequence analysisCh.6 and Ch.7

• Sequence acquisition

• Sequence data

• Reconstructing sequence

• Sequence alignment

• Alignment algorithms

• Database searching

• Uses of alignments

http://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg

ABI era

Source: wikipedia

Scaling up by brute force$3,000,000,000 genome

Source: G. Church

Source: G. Church

A Genome Analyzer flowcell (left) and imaging region or ‘tile’ (right), with a magnified section showing a cluster.

Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23.

http://www.eurofinsdna.com

Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23.

Name

Confidence call

Sequence

CCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC

Read

Alignment

Reference genome

Hypothesis #1

Genome ref

Read

Hypothesis #2

Read

Genome ref

Quality, Q, is the function of the probability, P, that the sequences called a wrong base

)(log10 10 PQ −=

Q is estimated by the sequencing sofware.

Q=10: 1 in 10 chance that base was miscalledQ=20: 1 in 100 chance that base was miscalledQ=30: 1 in 1000 chance

Genome ref

Read

Genome ref

Read

Hypothesis #1

Hypothesis #2

Genome ref

Read

Genome ref

Read

Hypothesis #1

Hypothesis #2

Q=30

Q=10

A penalty scheme to account for different types of dissimilarities

Let’s stipulate that small gaps (indels) occur in bacterial genomes at 1 in 10K positions

40)0001.0(log10

)(log10

10

10

=−=

=−= gapgap PPenalty

Let’s stipulate that SNPs occur in bacterial genomes at 1 in 1K,but depending on Q, sequencing error maybe more likely

}{

}{ }30,min{)001.0(log10,min

)(log10),(log10min

10

1010min

QQ

PPPen SNPmiscall

=−≡

≡−−≡

Penaltygap = 40

PenaltySNP =30

Penalty = 40

Q=40

Q=10

Penalty = 70

Penalty = 50

How do we find our sequence in the first place?

Local alignmentGiven a string P (“pattern”) of length m and a string T(“text”) of length n, find substrings a and b of P and T, respectively,having maximal global alignment score

-40TCG

|||

T-G

Gap in text

-40T-G

|||

TTG

Gap in pattern

-30TCG

|||

TTG

Mismatch

+15TTG

|||

TTG

Match

PenaltyExampleEvent

Smith-Waterman

Smith-Waterman

Smith-Waterman

Smith-Waterman

Smith-Waterman

Smith-Waterman

Dynamic Programming

Indexing

Indexing

Dot Plots

Window:2, Stringency:1

Dot Plots

Window:2, Stringency:1

Dot matrix analysisof DNA sequence encoding λ cI repressor (vertical) and P22 c2

repressor

Window - 11;Stringency - 7

Analysis of the regions of low complexity

Calculation of an alignment score

Pairwise Alignment Examples (II)

Dispersed alignment without gaps may

have higher score than more

visually appealing alignment with gaps

An alignment scoring system is required to evaluate how good an alignment is

• positive and negative values assigned

• gap creation and extension penalties

• positive score for identities

• some partial positive score for conservative substitutions

• global versus local alignment

• use of a substitution matrix

“Window location” by FASTA and BLAST

The global alignment algorithm of Needleman and Wunsch(1970).

The local alignment algorithm of Smith and Waterman (1981).

BLAST, a heuristic version of Smith-Waterman.

Two kinds of sequence alignment: global and local

Should result of alignment include all amino acids or proteins or just those that "match"? If yes, a global alignment is desiredIn a global alignment, presence of mismatched elements is neutral - doesn't affect overall match score

Should result of alignment include all amino acids or proteins or just those that "match"? If no, a local alignment is desiredLocal alignments accomplished by including negative scores for "mismatched" positions, thus scores get worse as we move away from region of matchInstead of starting traceback with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero

What is Database Search ?

• Find a particular (usually) short sequence in a database of sequences (or one huge sequence).

• Problem is identical to local sequence alignment, but on a much larger scale.

• We must also have some idea of the significance of a database hit.– Databases always return some kind of hit, how much attention

should be paid to the result?

• A similar problem is the global alignment of two large sequences

• General idea: good alignments contain high scoring regions.

Imperfect Alignment

• What is an imperfect alignment?

• Why imperfect alignment?

• The result may not be optimal.

• Finding optimal alignment is usually to costly in terms of time and memory.

Database Search Methods

• Hash table based methods– FASTA family

• FASTP, FASTA, TFASTA, FASTAX, FASTAY

– BLAST family• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,

MegaBLAST, PsiBLAST, PhiBLAST

– Others• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS

• Suffix tree based methods– Mummer, AVID, Reputer, MGA, QUASAR

Database Search Methods

• Hash table based methods– FASTA family

• FASTP, FASTA, TFASTA, FASTAX, FASTAY

– BLAST family• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,

MegaBLAST, PsiBLAST, PhiBLAST

– Others• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS

• Suffix tree based methods– Mummer, AVID, Reputer, MGA, QUASAR

History of sequence searching

• 1970: NW

• 1980: SW

• 1985: FASTA

• 1990: BLAST

Hash Table

• K-gram = subsequence of length K

• Ak entries

– A is alphabet size

• Linear time construction

• Constant lookup time

FASTP

Lipman & Pearson, 1985

FASTP

• Three phase algorithm

1. Find short good matches using k-grams

1. K = 1 or 2

2. Find start and end positions for good matches

3. Use DP to align good matches

FASTP

• Three phase algorithm

1. Find short good matches using k-grams

1. K = 1 or 2

2. Find start and end positions for good matches

3. Use DP to align good matches

FASTP: Phase 1 (2)

• Similar to dot plot• Offsets range from 1-m to

n-1• Each offset is scored as

– # matches - # mismatches

• Diagonals (offsets) with large score show local similarities

FASTP: Phase 2

• 5 best diagonal runs are found

• Rescore these 5 regions using PAM250.

– Initial score

• Indels are not considered yet

FASTP: Phase 3

• Sort the aligned regions in descending score

• Optimize these alignments using Needleman-Wunsch

• Report the results

FASTA – Improvement Over FASTP

Pearson 1995

FASTA (1)

• Phase 2: Choose 10 best diagonal runs instead of 5

FASTA (2)

• Phase 2.5– Eliminate diagonals that score less than some given threshold.– Combine matches to find longer matches. It incurs join penalty

similar to gap penalty

FASTA Variations

• TFASTAX and TFASTAY: query protein against a DNA library in all reading frames

• FASTAX, FASTAY: DNA query in all reading frames against protein database

BLAST

Altschul, Gish, Miller, Myers, Lipman, 1990

BLAST (or BLASTP)

• BLAST – Basic Local Alignment Search Tool

• An approximation of Smith-Waterman

• Designed for database searches

– Short query sequence against long database sequence or a database of many sequences

• Sacrifices search sensitivity for speed

BLAST Algorithm (1)

• Eliminate low complexity regions from the query sequence.– Replace them with X (protein) or N (DNA)

• Hash table on query sequence. – K = 3 for proteins

MCG

CGP

MCGPFILGTYC

BLAST Algorithm (2)

• For each k-gram find all k-grams that align with score at least cutoff T using BLOSUM62– 20k candidates

– ~50 on the average per k-gram

– ~50n for the entire query

• Build hash table

PQG

QGM

PQGMCGPFILGTYC

PQG

PQG 18

PEG 15

PRG 14

PSG 13

PQA 12

T = 13

BLAST Algorithm (3)

• Sequentially scan the database and locate each k-gram in the hash table

• Each match is a seed for an ungappedalignment.

BLAST Algorithm (4)

• HSP (High Scoring Pair) = A match between a query word and the database

• Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A

• Extend the hit until the score falls below a threshold value, X

BLAST Algorithm (5)

• Keep only the extended matches that have a score at least S.

• Determine the statistical significance of the result

BLASTN

• BLAST for nucleic acids

• K = 11

• Exact match instead of neighborhood search.

BLAST Variations

GappedNucleic acidProteinTBLASTX

GappedNucleic acidProteinTBLASTN

GappedProteinNucleic acidBLASTX

GappedNucleic acidNucleic acidBLASTN

GappedProteinProteinBLASTP

TypeTargetQueryProgram

Even More Variations

– PsiBLAST (iterative)

– BLAT, BLASTZ, MegaBLAST

– FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS

– Main differences are

• Seed choice (k, gapped seeds)

• Additional data structures

Suffix Tree

• Tree structure that contains all suffixes of the input sequence

• TGAGTGCGA

• GAGTGCGA

• AGTGCGA

• GTGCGA

• TGCGA

• GCGA

• CGA

• GA

• A

Suffix Tree Example

• O(n) space and construction time

– 10n to 70n space usage reported

• O(m) search time for m-letter sequence

• Good for

– Small data

– Exact matches

Suffix Tree Analysis

Suffix Array

• 5 bytes per letter

• O(m log n) search time

• Better space usage

• Slower search

Mummer

Other Sequence Comparison Tools

• Reputer, MGA, AVID

• QUASAR (suffix array)

Uses of sequence alignment

• Search databases

• Assess similarity, relatedness

• Identify structural variations (point, gross)

• Determine specificity of primers

• Evaluate complexity of a sequence

• Assemble sequence de novo

Recommended