Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W

Generic substitution matrix -based sequence similarity evaluation

Q: M A T W L I .

A: M A - W T V .

Scr: 45 -? 11 -1 3Scr: 45 -2 -2 1

Q: M A T W L I .

A: M A W T V A .

Total: 5

-1

Total = 22 - ?

Blosum 62:

Gap openning: -6 ~ -15

Gap Extension: -2 ~ -6

Position –specific matrices reflect the structural-function relationship of a given protein family

BID_MOUSE I A R H L A Q I G D E MBAD_MOUSE Y G R E L R R M S D E FBAK_MOUSE V G R Q L A L I G D D IBAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E FHRK_HUMAN T A A R L K A L G D E LEgl-1 I G S K L A A M C D D F

Statistical representation

G: 5 -> 71%

S: 1 -> 14 %

C: 1 -> 14 %

Genomic sequence analysis

Genome organization/gene structure.

Comparing genome organization.

Identifying regulatory modules.

Genome Browser / Map viewer

o NCBI, Ensemble, species databases.

o Range selection, Zoom in/out.

o Retrieving genomic sequences.o Fastacmd

o Python script

Practice: retrieve genomic sequence using the genome browser

1. Identify the range that you would like to retrieve (start and end positions) by clicking on the features in the map. It helps to have an round-up position (e.g xx,xxx,000) for easy mapping back.

2. Input the number in the data retrieve window.

Practice/observe: retrieve genomic sequence using fastacmd or python script

1. Fastacmd is a program distributed together with Blast for sequence retrieval. Takes input files of sequence IDs. -- strict requirement of database format.

2. FindSeq_WithID.py or FindSeq_Partialmatch.py are simple python scripts for retrieving sequences based on fasta format sequence identification line (following the “>”).

Practice: Gene structure analysis using GeneScan

1. Identify and save the DNA sequence file

2. Upload to GeneScan sever at MIT, Pasteur Institue,

http://genes.mit.edu/GENSCAN.html

http://genome.dkfz-heidelberg.de/cgi-bin/GENSCAN/genscan.cgi

GeneScan Result

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr.. ----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 1.05 Term - 1016 900 117 2 0 136 43 67 0.042 5.84

1.04 Intr - 5852 5662 191 0 2 91 100 290 0.989 30.33

1.03 Intr - 7544 7256 289 2 1 156 113 184 0.973 25.06

1.02 Intr - 19042 18909 134 2 2 27 52 67 0.065 -2.13

1.01 Init - 19898 19853 46 2 1 80 110 36 0.642 3.94

Basis of Gene structure prediction

• GC contents• Promoter signal (ie.

TATA box)• Splicing signal• Translation initiation

signal• …..

Probability modeling

Weighted scoring scheme

Detection

From data to model

>seq0 gtcttttttttaaCTTATTTGAAGGgcctcggtaaccg

> seq1 gaatataatgctttcttggtggtgggatcattttagggattccgccctccTTTATAAAATACgcctagt

> seq2gcgctttacttaaCGTACTAGAAGCtaga

>seq3 gttgtttgggttgaatccgTGCCTGAAAGTGaataattagacagaactat actttggggactaagtcg

>seq4 gctttCATATGAATTCCtcttcgtcggtaatcatgtataaggtaaattct taacacgg

>seq5 caactacaagAGCGTATAAGGGctcgggaacccgaagacggtgagacatt

>……………………………….

TATA containing core promoter sequences

###MATCH_STATE 3

0.045627 # Symbol A probability

0.088656 # Symbol C probability

0.075249 # Symbol G probability

0.790468 # Symbol T probability###MATCH_STATE 4

0.600385 # Symbol A probability

0.134107 # Symbol C probability

0.106520 # Symbol G probability

0.158987 # Symbol T probability

From data to model

>SNR17A_15_780119_780275_INTRON GUAUGUAAUAUACCCCAAACAUUUUACCCACAAAAAACCAGGAUUUGAAA ACUAUAGCAUCUAAAAGUCUUAGGUACUAGAGUUUUCAUUUCGGAGCAGG CUUUUUGAAAAAUUUAAUUCAACCAUUGCAGCAGCUUUUGACUAACACAU UCUACAG

>SNR17B_16_281502_281373_INTRON GUAUGUUUUAUACCAUAUACUUUAUUAGGAAUAUAACAAAGCAUACCCAA UAAUUAGGCAAUGCGAUUGUCGUAUUCAACAACCAUCUUCUAUUUCACCA GCUUCAGGUUUUGACUAACACAUUCAACAG

>YAL001C_1_151163_147591_INTRON_71_160 GUAUGUUCAUGUCUCAUUCUCCUUUUCGGCUCCGUUUAGGUGAUAAACGU ACUAUAUUGUGAAAGAUUAUUUACUAACGACACAUUGAAG

>YAL003W_1_142172_143158_INTRON_81_446 GUAUGUUCCGAUUUAGUUUACUUUAUAGAUCGUUGUUUUUCUUUCUUUUU UUUUUUUCCUAUGGUUACAUGUAAAGGGAAGUUAACUAAUAAUGAUUACU UUUUUUCGCUUAUGUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUG AUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUA UCACAGUAUCUGACGAUAGCACAGAGCAGAGUAUCAUUAUUAGUUAUCUG UUAUUUUUUUUUCCUUUUUUGUUCAAAAAAAGAAAGACAGAGUCUAAAGA

>………

……………………….

500 verified exon sequences

Modeling

Basis of Gene structure prediction

o GC contentso Promoter signal

(ie. TATA box)o Splicing signalo Translation

initiation signalo …..

Final score / p value

Accuracy

Accuracy per nucleotide Accuracy per exon

Method Sn Sp AC Sn Sp (Sn+Sp)/2 ME WE

GENSCAN 0.93 0.93 0.91 0.78 0.81 0.80 0.09 0.05

FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11

GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24

GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17

GenLang 0.72 0.75 0.69 0.50 0.49 0.50 0.21 0.21

GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.10

SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14

Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13

ME: Missing ExonsWE: Wrong Exons

Sn: Sensitivity (find the right one)Sp: Specificity (true positive)

DNA Pattern – Transcription factor binding sites

A C G T Consens

us 40 13 23 23 N

20 3 70 5 G

55 3 40 0 R

0 93 0 5 C

53 8 8 30 W

15 0 3 82 T

0 0 100 0 G

0 50 0 50 Y

0 68 0 30 C

12 35 3 48 Y

Stringency of the matrices

A C G T Consens

us 40 13 23 23 N

20 3 70 5 G

55 3 40 0 R

0 93 0 5 C

53 8 8 30 W

15 0 3 82 T

0 0 100 0 G

0 50 0 50 Y

0 68 0 30 C

12 35 3 48 Y

A C G T

Consensus

4 0 13 0 G 5 0 12 0 G

15 0 2 0 A 0 17 0 0 C

17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 13 0 4 C 0 17 0 0 C 0 17 0 0 C 0 0 17 0 G 0 0 17 0 G 2 0 15 0 G 0 17 0 0 C

17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 2 0 15 T 0 13 0 4 C 0 7 2 7 Y P53_01

P53_02

Consensus –10 bp

Consensus –20 bp

Comparing genomes

For understanding genome organization.

For identifying functionally conserved region / sequences. 3’, 5’ UTR (eg. microRNA binding sites) Transcription factor binding sites /

regulatory modules.

Vista Genome Browser

Practice & Observe: cross genome comparison using vista browser

Cautions with genome browser and description of genomic sequences

Coordinates changes with every release/build of genome. – refer to genome release in your work and publication.

Predicated gene structure ≠ verified gene structure.

Identifying conserved regulatory modules

• Regulatory module: a set of TF binding sites that controls a particular aspects of transcriptional regulation.

• Functional requirement conservation at the binding site (sequence) level.

Ways to Identify conserved regulatory modules

• Based on sequence similarity: MEME, rVista, Whole genome rVista for model

organisms…

• Based on binding site identity: BLISS

Practice: Identify conserved TFBSs upstream of the human TNF gene.

Vista Genome Browser

Practice & Observe: cross genome comparison using vista browser

Practice: Identify conserved TFBSs upstream of the human TNF gene.

Use precompiled TFBS conservation data.

Load genomic sequence.

Practice: Load the BED file of TF binding sites to UCSC genome browser.

Large Data Set Analysis.

Hardware considerations:

1.) Data storage. FASTA record of a protein (1,000 aa) ~ 1 KB. Human proteome, or Chromosome 21 ~ 50 MB Human genome ~ 1.5 GB HTS transcriptome analysis (4 samples @ 40

million reads each) original and derived data sets ~ 200 GB



2.) Processors and RAM. Comparison: tbalstn of 5 protein sequences

against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours).

RAM < data size will greatly slow down the process.



3.) Operating system determines the availability of tools. Linux is the default development system for

most bioinformatics groups. It is also the OS of the UFHPC.

Easy control and automation. Portable to Mac OSX, but often requires

recompiling the source code.

Observe: demanding computation for large data set analysis.

Practice: log into UFHPC.

First step

Documents

Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: 45 -2 1 Q: M A T W L I. A: M A W