29
Generic substitution matrix - based sequence similarity evaluation Q: M A T W L I . A: M A - W T V . Scr: 4 5 - ? 11 - 1 3 Scr: 4 5 - 2 - 2 1 Q: M A T W L I . A: M A W T V A . Total: 5 - 1 Total = 22 - ? Blosum 62: Gap openning: -6 ~ -15 Gap

Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Embed Size (px)

Citation preview

Page 1: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Generic substitution matrix -based sequence similarity evaluation

Q: M A T W L I .

A: M A - W T V .

Scr: 45 -? 11 -1 3Scr: 45 -2 -2 1

Q: M A T W L I .

A: M A W T V A .

Total: 5

-1

Total = 22 - ?

Blosum 62:

Gap openning: -6 ~ -15

Gap Extension: -2 ~ -6

Page 2: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Position –specific matrices reflect the structural-function relationship of a given protein family

BID_MOUSE I A R H L A Q I G D E MBAD_MOUSE Y G R E L R R M S D E FBAK_MOUSE V G R Q L A L I G D D IBAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E FHRK_HUMAN T A A R L K A L G D E LEgl-1 I G S K L A A M C D D F

Statistical representation

G: 5 -> 71%

S: 1 -> 14 %

C: 1 -> 14 %

Page 3: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Genomic sequence analysis

Genome organization/gene structure.

Comparing genome organization.

Identifying regulatory modules.

Page 4: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Genome Browser / Map viewer

o NCBI, Ensemble, species databases.

o Range selection, Zoom in/out.

o Retrieving genomic sequences.o Fastacmd

o Python script

Page 5: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice: retrieve genomic sequence using the genome browser

1. Identify the range that you would like to retrieve (start and end positions) by clicking on the features in the map. It helps to have an round-up position (e.g xx,xxx,000) for easy mapping back.

2. Input the number in the data retrieve window.

Page 6: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice/observe: retrieve genomic sequence using fastacmd or python script

1. Fastacmd is a program distributed together with Blast for sequence retrieval. Takes input files of sequence IDs. -- strict requirement of database format.

2. FindSeq_WithID.py or FindSeq_Partialmatch.py are simple python scripts for retrieving sequences based on fasta format sequence identification line (following the “>”).

Page 7: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice: Gene structure analysis using GeneScan

1. Identify and save the DNA sequence file

2. Upload to GeneScan sever at MIT, Pasteur Institue,

Page 8: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

GeneScan Result

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr.. ----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 1.05 Term - 1016 900 117 2 0 136 43 67 0.042 5.84

1.04 Intr - 5852 5662 191 0 2 91 100 290 0.989 30.33

1.03 Intr - 7544 7256 289 2 1 156 113 184 0.973 25.06

1.02 Intr - 19042 18909 134 2 2 27 52 67 0.065 -2.13

1.01 Init - 19898 19853 46 2 1 80 110 36 0.642 3.94

Page 9: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Basis of Gene structure prediction

• GC contents• Promoter signal (ie.

TATA box)• Splicing signal• Translation initiation

signal• …..

Probability modeling

Weighted scoring scheme

Detection

Page 10: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

From data to model

>seq0 gtcttttttttaaCTTATTTGAAGGgcctcggtaaccg

> seq1 gaatataatgctttcttggtggtgggatcattttagggattccgccctccTTTATAAAATACgcctagt

> seq2gcgctttacttaaCGTACTAGAAGCtaga

>seq3 gttgtttgggttgaatccgTGCCTGAAAGTGaataattagacagaactat actttggggactaagtcg

>seq4 gctttCATATGAATTCCtcttcgtcggtaatcatgtataaggtaaattct taacacgg

>seq5 caactacaagAGCGTATAAGGGctcgggaacccgaagacggtgagacatt

>……………………………….

TATA containing core promoter sequences

###MATCH_STATE 3

0.045627 # Symbol A probability

0.088656 # Symbol C probability

0.075249 # Symbol G probability

0.790468 # Symbol T probability###MATCH_STATE 4

0.600385 # Symbol A probability

0.134107 # Symbol C probability

0.106520 # Symbol G probability

0.158987 # Symbol T probability

Page 11: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

From data to model

>SNR17A_15_780119_780275_INTRON GUAUGUAAUAUACCCCAAACAUUUUACCCACAAAAAACCAGGAUUUGAAA ACUAUAGCAUCUAAAAGUCUUAGGUACUAGAGUUUUCAUUUCGGAGCAGG CUUUUUGAAAAAUUUAAUUCAACCAUUGCAGCAGCUUUUGACUAACACAU UCUACAG

>SNR17B_16_281502_281373_INTRON GUAUGUUUUAUACCAUAUACUUUAUUAGGAAUAUAACAAAGCAUACCCAA UAAUUAGGCAAUGCGAUUGUCGUAUUCAACAACCAUCUUCUAUUUCACCA GCUUCAGGUUUUGACUAACACAUUCAACAG

>YAL001C_1_151163_147591_INTRON_71_160 GUAUGUUCAUGUCUCAUUCUCCUUUUCGGCUCCGUUUAGGUGAUAAACGU ACUAUAUUGUGAAAGAUUAUUUACUAACGACACAUUGAAG

>YAL003W_1_142172_143158_INTRON_81_446 GUAUGUUCCGAUUUAGUUUACUUUAUAGAUCGUUGUUUUUCUUUCUUUUU UUUUUUUCCUAUGGUUACAUGUAAAGGGAAGUUAACUAAUAAUGAUUACU UUUUUUCGCUUAUGUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUG AUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUA UCACAGUAUCUGACGAUAGCACAGAGCAGAGUAUCAUUAUUAGUUAUCUG UUAUUUUUUUUUCCUUUUUUGUUCAAAAAAAGAAAGACAGAGUCUAAAGA

>………

……………………….

500 verified exon sequences

Modeling

Page 12: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Basis of Gene structure prediction

o GC contentso Promoter signal

(ie. TATA box)o Splicing signalo Translation

initiation signalo …..

Final score / p value

Page 13: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Accuracy

Accuracy per nucleotide Accuracy per exon

Method Sn Sp AC Sn Sp (Sn+Sp)/2 ME WE

GENSCAN 0.93 0.93 0.91 0.78 0.81 0.80 0.09 0.05

FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11

GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24

GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17

GenLang 0.72 0.75 0.69 0.50 0.49 0.50 0.21 0.21

GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.10

SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14

Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13

ME: Missing ExonsWE: Wrong Exons

Sn: Sensitivity (find the right one)Sp: Specificity (true positive)

Page 14: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

DNA Pattern – Transcription factor binding sites

A C G T Consens

us 40 13 23 23 N

20 3 70 5 G

55 3 40 0 R

0 93 0 5 C

53 8 8 30 W

15 0 3 82 T

0 0 100 0 G

0 50 0 50 Y

0 68 0 30 C

12 35 3 48 Y

Page 15: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Stringency of the matrices

A C G T Consens

us 40 13 23 23 N

20 3 70 5 G

55 3 40 0 R

0 93 0 5 C

53 8 8 30 W

15 0 3 82 T

0 0 100 0 G

0 50 0 50 Y

0 68 0 30 C

12 35 3 48 Y

A C G T

Consensus

4 0 13 0 G 5 0 12 0 G

15 0 2 0 A 0 17 0 0 C

17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 13 0 4 C 0 17 0 0 C 0 17 0 0 C 0 0 17 0 G 0 0 17 0 G 2 0 15 0 G 0 17 0 0 C

17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 2 0 15 T 0 13 0 4 C 0 7 2 7 Y P53_01

P53_02

Consensus –10 bp

Consensus –20 bp

Page 16: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Comparing genomes

For understanding genome organization.

For identifying functionally conserved region / sequences. 3’, 5’ UTR (eg. microRNA binding sites) Transcription factor binding sites /

regulatory modules.

Page 17: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Vista Genome Browser

Practice & Observe: cross genome comparison using vista browser

Page 18: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Cautions with genome browser and description of genomic sequences

Coordinates changes with every release/build of genome. – refer to genome release in your work and publication.

Predicated gene structure ≠ verified gene structure.

Page 19: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Identifying conserved regulatory modules

• Regulatory module: a set of TF binding sites that controls a particular aspects of transcriptional regulation.

• Functional requirement conservation at the binding site (sequence) level.

Page 20: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Ways to Identify conserved regulatory modules

• Based on sequence similarity: MEME, rVista, Whole genome rVista for model

organisms…

• Based on binding site identity: BLISS

Page 21: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice: Identify conserved TFBSs upstream of the human TNF gene.

Page 22: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Vista Genome Browser

Practice & Observe: cross genome comparison using vista browser

Page 23: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice: Identify conserved TFBSs upstream of the human TNF gene.

Use precompiled TFBS conservation data.

Load genomic sequence.

Page 24: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice: Load the BED file of TF binding sites to UCSC genome browser.

Page 25: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Large Data Set Analysis.

Hardware considerations:

1.) Data storage. FASTA record of a protein (1,000 aa) ~ 1 KB. Human proteome, or Chromosome 21 ~ 50 MB Human genome ~ 1.5 GB HTS transcriptome analysis (4 samples @ 40

million reads each) original and derived data sets ~ 200 GB

Page 26: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Large Data Set Analysis.

Hardware considerations:

2.) Processors and RAM. Comparison: tbalstn of 5 protein sequences

against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours).

RAM < data size will greatly slow down the process.

Page 27: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Large Data Set Analysis.

Hardware considerations:

3.) Operating system determines the availability of tools. Linux is the default development system for

most bioinformatics groups. It is also the OS of the UFHPC.

Easy control and automation. Portable to Mac OSX, but often requires

recompiling the source code.

Page 28: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Observe: demanding computation for large data set analysis.

Page 29: Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Practice: log into UFHPC.

First step