Upload
rebecca-lite
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Computational Biology, Part 2Sequence Motifs
Computational Biology, Part 2Sequence Motifs
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1996, 1999-2009. 1996, 1999-2009.
All rights reserved.All rights reserved.
Slides from Chapter 4Slides from Chapter 4
Ch04_Motifs_mod.ppt
Describing features using frequency matricesDescribing features using frequency matrices Goal: Describe a sequence feature (or Goal: Describe a sequence feature (or
motifmotif) more quantitatively than possible ) more quantitatively than possible using consensus sequencesusing consensus sequences
Need to describe how often particular bases Need to describe how often particular bases are found in particular positions in a are found in particular positions in a sequence featuresequence feature
Describing features using frequency matricesDescribing features using frequency matrices DefinitionDefinition: For a feature of length : For a feature of length mm using using
an alphabet of an alphabet of nn characters, a characters, a frequency frequency matrix matrix is an is an nn by by mm matrix in which each matrix in which each element contains the frequency at which a element contains the frequency at which a given member of the alphabet is observed at given member of the alphabet is observed at a given position in an aligned set of a given position in an aligned set of sequences containing the featuresequences containing the feature
Frequency matrices (continued)Frequency matrices (continued)
Three uses of frequency matricesThree uses of frequency matrices DescribeDescribe a sequence feature a sequence feature Calculate Calculate probability of occurrenceprobability of occurrence of feature of feature
in a random sequencein a random sequence Calculate Calculate degree of matchdegree of match between a new between a new
sequence and a featuresequence and a feature
Matlab DemonstrationMatlab Demonstration
% read some aligned sequences provided with the bioinformatics % read some aligned sequences provided with the bioinformatics toolboxtoolbox
seqs = fastaread('pf00002.fa');seqs = fastaread('pf00002.fa');
seqdisp(seqs);seqdisp(seqs);
startposition=4; endposition=13;startposition=4; endposition=13;
[P,S] = seqprofile(seqs,'limits',[startposition endposition]);[P,S] = seqprofile(seqs,'limits',[startposition endposition]);
disp([' ' sprintf('%2d ',[1:size(P,2)])]);disp([' ' sprintf('%2d ',[1:size(P,2)])]);
for i=1:length(S)for i=1:length(S)
disp([S(i) ' ' sprintf('%4.3f ',P(i,:))])disp([S(i) ' ' sprintf('%4.3f ',P(i,:))])
endend
seqlogo(seqs,'startat',startposition,'endat',endposition,'alphabet','aa’);seqlogo(seqs,'startat',startposition,'endat',endposition,'alphabet','aa’);
Frequency matrixFrequency matrix
Logo ExampleLogo Example
Logos for displaying sequence motifsLogos for displaying sequence motifs http://www.ccrnp.ncifcrf.gov/~toms/sequencelogo.html
Free logo maker at Free logo maker at http://weblogo.berkeley.edu/
Frequency Matrices, PSSMs, and ProfilesFrequency Matrices, PSSMs, and Profiles A A frequency matrixfrequency matrix can be converted to a can be converted to a
PPosition-osition-SSpecific pecific SScoring coring MMatrix (atrix (PSSMPSSM) ) by converting by converting frequenciesfrequencies to to scoresscores
PSSMPSSMs also called s also called PPosition osition WWeight eight MMatrixes (atrixes (PWMPWMs) or s) or ProfilesProfiles
Methods for converting frequency matrices to PSSMsMethods for converting frequency matrices to PSSMs Using log ratio of observed to expectedUsing log ratio of observed to expected
where where m(j,i)m(j,i) is the frequency of character is the frequency of character jj observed at position observed at position i i and and f(j)f(j) is the overall frequency of character j (usually in some is the overall frequency of character j (usually in some large set of sequences)large set of sequences)
Using amino acid substitution matrix (Dayhoff similarity Using amino acid substitution matrix (Dayhoff similarity matrix) [see later]matrix) [see later]
€
score( j,i) = logm( j,i) / f ( j)
Pseudo-countsPseudo-counts
How do we get a score for a position with How do we get a score for a position with zero counts for a particular character? zero counts for a particular character? Can’t take log(0).Can’t take log(0).
Solution: add a small number to all Solution: add a small number to all positions with zero frequencypositions with zero frequency
Finding occurrences of a sequence feature using a ProfileFinding occurrences of a sequence feature using a Profile As with finding occurrences of a consensus As with finding occurrences of a consensus
sequence, we consider all positions in the sequence, we consider all positions in the target sequence as candidate matchestarget sequence as candidate matches
For each position, we calculate a score by For each position, we calculate a score by “looking up” the value corresponding to the “looking up” the value corresponding to the base at that positionbase at that position
Block Diagram for Building a PSSM – Aligned SequencesBlock Diagram for Building a PSSM – Aligned Sequences
PSSM builder
Set of Aligned Sequence Features
Expected frequencies of each sequence element
PSSM
Block Diagram for Building a PSSM – Unaligned SequencesBlock Diagram for Building a PSSM – Unaligned Sequences
PSSM builder
Set of unaligned sequences
Expected frequencies of each sequence element
PSSM
Parameters for aligning (i.e., expected length)
Block Diagram for Searching with a PSSMBlock Diagram for Searching with a PSSM
PSSM search
PSSM
Set of Sequences to search
Sequences that match above thresholdThreshold
Positions and scores of matches
Block Diagram for Searching for sequences related to a family with a PSSM
Block Diagram for Searching for sequences related to a family with a PSSM
PSSM search
PSSM
Set of Sequences to search
Sequences that match above threshold
Threshold
Positions and scores of matches
PSSM builder
Set of Aligned Sequence Features
Expected frequencies of each sequence element
Consensus sequences vs. PSSMsConsensus sequences vs. PSSMs
Should I use a Should I use a consensus sequenceconsensus sequence or a or a frequency matrixfrequency matrix to describe my site? to describe my site? If all allowed characters at a given position are If all allowed characters at a given position are
equally "good", use IUB codes to create equally "good", use IUB codes to create consensus sequenceconsensus sequence Example: Restriction enzyme recognition sitesExample: Restriction enzyme recognition sites
If some allowed characters are "better" than If some allowed characters are "better" than others, use PSSMothers, use PSSM Example: Promoter sequencesExample: Promoter sequences
Consensus sequences vs. frequency matricesConsensus sequences vs. frequency matrices Advantages of consensus sequencesAdvantages of consensus sequences: :
smaller description, quicker comparisonsmaller description, quicker comparison DisadvantageDisadvantage: lose quantitative information : lose quantitative information
on preferences at certain locationson preferences at certain locations
Reading for next classReading for next class
Jones/Pevzner Ch 6 through section 6.9 (p. Jones/Pevzner Ch 6 through section 6.9 (p. 185)185)
Read paper by Needleman and Wunsch on Read paper by Needleman and Wunsch on web siteweb site
(recommended) Durbin et al, pp 17-32(recommended) Durbin et al, pp 17-32