Upload
homer-barker
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Generic substitution matrix -based sequence similarity evaluation
Q: M A T W L I .
A: M A - W T V .
Scr: 45 -? 11 -1 3Scr: 45 -2 -2 1
Q: M A T W L I .
A: M A W T V A .
Total: 5
-1
Total = 22 - ?
Blosum 62:
Gap openning: -6 ~ -15
Gap Extension: -2 ~ -6
Position –specific matrices reflect the structural-function relationship of a given protein family
BID_MOUSE I A R H L A Q I G D E MBAD_MOUSE Y G R E L R R M S D E FBAK_MOUSE V G R Q L A L I G D D IBAXB_HUMAN L S E C L K R I G D E L BimS I A Q E L R R I G D E FHRK_HUMAN T A A R L K A L G D E LEgl-1 I G S K L A A M C D D F
Statistical representation
G: 5 -> 71%
S: 1 -> 14 %
C: 1 -> 14 %
Genomic sequence analysis
Genome organization/gene structure.
Comparing genome organization.
Identifying regulatory modules.
Genome Browser / Map viewer
o NCBI, Ensemble, species databases.
o Range selection, Zoom in/out.
o Retrieving genomic sequences.o Fastacmd
o Python script
Practice: retrieve genomic sequence using the genome browser
1. Identify the range that you would like to retrieve (start and end positions) by clicking on the features in the map. It helps to have an round-up position (e.g xx,xxx,000) for easy mapping back.
2. Input the number in the data retrieve window.
Practice/observe: retrieve genomic sequence using fastacmd or python script
1. Fastacmd is a program distributed together with Blast for sequence retrieval. Takes input files of sequence IDs. -- strict requirement of database format.
2. FindSeq_WithID.py or FindSeq_Partialmatch.py are simple python scripts for retrieving sequences based on fasta format sequence identification line (following the “>”).
Practice: Gene structure analysis using GeneScan
1. Identify and save the DNA sequence file
2. Upload to GeneScan sever at MIT, Pasteur Institue,
GeneScan Result
Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr.. ----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------ 1.05 Term - 1016 900 117 2 0 136 43 67 0.042 5.84
1.04 Intr - 5852 5662 191 0 2 91 100 290 0.989 30.33
1.03 Intr - 7544 7256 289 2 1 156 113 184 0.973 25.06
1.02 Intr - 19042 18909 134 2 2 27 52 67 0.065 -2.13
1.01 Init - 19898 19853 46 2 1 80 110 36 0.642 3.94
Basis of Gene structure prediction
• GC contents• Promoter signal (ie.
TATA box)• Splicing signal• Translation initiation
signal• …..
Probability modeling
Weighted scoring scheme
Detection
From data to model
>seq0 gtcttttttttaaCTTATTTGAAGGgcctcggtaaccg
> seq1 gaatataatgctttcttggtggtgggatcattttagggattccgccctccTTTATAAAATACgcctagt
> seq2gcgctttacttaaCGTACTAGAAGCtaga
>seq3 gttgtttgggttgaatccgTGCCTGAAAGTGaataattagacagaactat actttggggactaagtcg
>seq4 gctttCATATGAATTCCtcttcgtcggtaatcatgtataaggtaaattct taacacgg
>seq5 caactacaagAGCGTATAAGGGctcgggaacccgaagacggtgagacatt
>……………………………….
TATA containing core promoter sequences
###MATCH_STATE 3
0.045627 # Symbol A probability
0.088656 # Symbol C probability
0.075249 # Symbol G probability
0.790468 # Symbol T probability###MATCH_STATE 4
0.600385 # Symbol A probability
0.134107 # Symbol C probability
0.106520 # Symbol G probability
0.158987 # Symbol T probability
From data to model
>SNR17A_15_780119_780275_INTRON GUAUGUAAUAUACCCCAAACAUUUUACCCACAAAAAACCAGGAUUUGAAA ACUAUAGCAUCUAAAAGUCUUAGGUACUAGAGUUUUCAUUUCGGAGCAGG CUUUUUGAAAAAUUUAAUUCAACCAUUGCAGCAGCUUUUGACUAACACAU UCUACAG
>SNR17B_16_281502_281373_INTRON GUAUGUUUUAUACCAUAUACUUUAUUAGGAAUAUAACAAAGCAUACCCAA UAAUUAGGCAAUGCGAUUGUCGUAUUCAACAACCAUCUUCUAUUUCACCA GCUUCAGGUUUUGACUAACACAUUCAACAG
>YAL001C_1_151163_147591_INTRON_71_160 GUAUGUUCAUGUCUCAUUCUCCUUUUCGGCUCCGUUUAGGUGAUAAACGU ACUAUAUUGUGAAAGAUUAUUUACUAACGACACAUUGAAG
>YAL003W_1_142172_143158_INTRON_81_446 GUAUGUUCCGAUUUAGUUUACUUUAUAGAUCGUUGUUUUUCUUUCUUUUU UUUUUUUCCUAUGGUUACAUGUAAAGGGAAGUUAACUAAUAAUGAUUACU UUUUUUCGCUUAUGUGAAUGAUGAAUUUAAUUCUUUGGUCCGUGUUUAUG AUGGGAAGUAAGACCCCCGAUAUGAGUGACAAAAGAGAUGUGGUUGACUA UCACAGUAUCUGACGAUAGCACAGAGCAGAGUAUCAUUAUUAGUUAUCUG UUAUUUUUUUUUCCUUUUUUGUUCAAAAAAAGAAAGACAGAGUCUAAAGA
>………
……………………….
500 verified exon sequences
Modeling
Basis of Gene structure prediction
o GC contentso Promoter signal
(ie. TATA box)o Splicing signalo Translation
initiation signalo …..
Final score / p value
Accuracy
Accuracy per nucleotide Accuracy per exon
Method Sn Sp AC Sn Sp (Sn+Sp)/2 ME WE
GENSCAN 0.93 0.93 0.91 0.78 0.81 0.80 0.09 0.05
FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11
GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24
GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17
GenLang 0.72 0.75 0.69 0.50 0.49 0.50 0.21 0.21
GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.10
SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14
Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13
ME: Missing ExonsWE: Wrong Exons
Sn: Sensitivity (find the right one)Sp: Specificity (true positive)
DNA Pattern – Transcription factor binding sites
A C G T Consens
us 40 13 23 23 N
20 3 70 5 G
55 3 40 0 R
0 93 0 5 C
53 8 8 30 W
15 0 3 82 T
0 0 100 0 G
0 50 0 50 Y
0 68 0 30 C
12 35 3 48 Y
Stringency of the matrices
A C G T Consens
us 40 13 23 23 N
20 3 70 5 G
55 3 40 0 R
0 93 0 5 C
53 8 8 30 W
15 0 3 82 T
0 0 100 0 G
0 50 0 50 Y
0 68 0 30 C
12 35 3 48 Y
A C G T
Consensus
4 0 13 0 G 5 0 12 0 G
15 0 2 0 A 0 17 0 0 C
17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 13 0 4 C 0 17 0 0 C 0 17 0 0 C 0 0 17 0 G 0 0 17 0 G 2 0 15 0 G 0 17 0 0 C
17 0 0 0 A 0 0 0 17 T 0 0 17 0 G 0 2 0 15 T 0 13 0 4 C 0 7 2 7 Y P53_01
P53_02
Consensus –10 bp
Consensus –20 bp
Comparing genomes
For understanding genome organization.
For identifying functionally conserved region / sequences. 3’, 5’ UTR (eg. microRNA binding sites) Transcription factor binding sites /
regulatory modules.
Vista Genome Browser
Practice & Observe: cross genome comparison using vista browser
Cautions with genome browser and description of genomic sequences
Coordinates changes with every release/build of genome. – refer to genome release in your work and publication.
Predicated gene structure ≠ verified gene structure.
Identifying conserved regulatory modules
• Regulatory module: a set of TF binding sites that controls a particular aspects of transcriptional regulation.
• Functional requirement conservation at the binding site (sequence) level.
Ways to Identify conserved regulatory modules
• Based on sequence similarity: MEME, rVista, Whole genome rVista for model
organisms…
• Based on binding site identity: BLISS
Practice: Identify conserved TFBSs upstream of the human TNF gene.
Vista Genome Browser
Practice & Observe: cross genome comparison using vista browser
Practice: Identify conserved TFBSs upstream of the human TNF gene.
Use precompiled TFBS conservation data.
Load genomic sequence.
Practice: Load the BED file of TF binding sites to UCSC genome browser.
Large Data Set Analysis.
Hardware considerations:
1.) Data storage. FASTA record of a protein (1,000 aa) ~ 1 KB. Human proteome, or Chromosome 21 ~ 50 MB Human genome ~ 1.5 GB HTS transcriptome analysis (4 samples @ 40
million reads each) original and derived data sets ~ 200 GB
Large Data Set Analysis.
Hardware considerations:
2.) Processors and RAM. Comparison: tbalstn of 5 protein sequences
against 1.2GB genome, ~15 sec CPU time. Map a single 10 M reads illumina run to human genome ~15,000 CPU sec (> 4 hours).
RAM < data size will greatly slow down the process.
Large Data Set Analysis.
Hardware considerations:
3.) Operating system determines the availability of tools. Linux is the default development system for
most bioinformatics groups. It is also the OS of the UFHPC.
Easy control and automation. Portable to Mac OSX, but often requires
recompiling the source code.
Observe: demanding computation for large data set analysis.
Practice: log into UFHPC.
First step