View
216
Download
1
Category
Tags:
Preview:
Citation preview
Genomics and Bioinformatics
The "new" biology
What is genomics
Genome All the DNA contained in the
cell of an organism
Genomics The comprehensive study of
the interactions and functional dynamics of whole sets of genes and their products. (NIAAA, NIH)
A "scaled-up" version of genetics research in which scientists can look at all of the genes in a living creature at the same time. (NIGMS, NIH)
Which organism’s genome was sequenced first?
Genome sequencing chronology
Year Organism SignificanceGenome size (bp)
Number of genes
1977
Bacteriophage fX174
First genome ever!
5,386 11
1981
Human mitochondria
First organelle
16,500 37
1995
Haemophilus influenzae Rd
First free-living organism
1,830,137 ~3,500
1996
Saccharomyces cerevisiae
First eukaryote
12,086,000 ~6,000
http://www.ncbi.nlm.nih.gov/ICTVdb/Images/Ackerman/Phages/Microvir/238-27_1.jpghttp://www.alsa.org/research/article.cfm?id=822
http://www.waterscan.co.yu/images/virusi-bakterije/Haemophilus%20influenzae.jpghttp://www.biochem.wisc.edu/yeastclub/buddingyeast(color).jpg
Genome sequencing chronology
Year Organism SignificanceGenome size (bp)
Number of genes
1998
Caenorhab-ditis elegans
First multi-cellular organism
97,000,000 ~19,000
1999
Human chromosome 22
First human chromosome
49,000,000 673
2000
Arabidopsis thaliana
First plant genome
150,000,000 ~25,000
2001
HumanFirst human genome
3,000,000,000 ~30,000
http://www.sih.m.u-tokyo.ac.jp/chem1.gif
http://lter.kbs.msu.edu/Biocollections/Herbarium/Images/ARBTH3H.jpg
Genome sequencing projects (as of 1/26,2007)
Sequencing strategies: Hierarchical shotgun sequencing
http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html
Genome size range
What’re there in the genomes? Why are there such a big difference?
viruses
plasmids
bacteria
fungi
plants
algae
insects
mollusks
reptiles
birds
mammals
104 108105 106 107 10111010109
bony fish
amphibians
Information contents in a genome
Gene Protein coding genes RNA genes
Regulatory elements Gene expression control Chromatin remodeling Matrix attachment sites
“Non-functional” elements Selfish elements “Junk” DNA ??
The “central dogma” of molecular biology
Central dogma
DNA
RNA
Protein
Transcription
Translation
Replication
Expanded “central dogma” of molecular biology
A more comprehensive view
DNA
RNA
Protein
Transcription
Translation
Replication
Metabolite
Pheno-type
New disciplines due to the advance in genomics
Omics
DNA
RNA
Protein
Transcription
Translation
Replication
Metabolite
Pheno-type
Structuralgenomics
Transcriptomics
Proteomics
Metabolomics
Genomic DNAsequences
Transcript seqMicroarray data
Cis-elementsTF binding sites
Epigenetic regulation
Shotgun protein seqSubcellular location
Post-translational modProtein interactionProtein structure
Metabolite concnMetabolic flux
Genetic interactionsSystematic KO
Disease information
Nature omics gateway
http://www.nature.com/omics/subjects/index.html
Three perspectives of our biological world
The cellular level, the individual, the tree of life
Rosenzweig et al., 2002. Conservation Biol.Image: htto://www.tolweb.org/tree/Image: http://www.olympusfluoview.com/gallery/cells/hela/helacells.html
~1014 cells per individual 2-100x106 species~3x104 genes
Further complications
Cell-cell interactions
Cell types
Environmental conditions
Developmental programming
Interactions at the organismal level
Interactions at the population, ecosystem level
Definition of bioinformatics
Bioinformatics Research, development, or application of Computational tools and approaches for expanding the use of Biological, medical, behavioral or health data, including those
to Acquire, store, organize, archive, analyze, or visualize such
data.
Computational biology The development and application of Data-analytical and theoretical methods, mathematical
modeling and computational simulation techniques to The study of biological, behavioral, and social systems
Q: What kinds of data are we taking about?
http://www.bisti.nih.gov/
Example: Sequence assembly
Cut into ~150kb pieces
Clone into Bacterial Artificial Chromosome (BAC)
Mapped to determine order of the BAC clones (golden/tiling path)
Shear a BAC clone randomly
Sequencing
Assembie sequence reads
http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html
Sequence assembly
Challenges The presence of gaps
Due to incomplete coverage Sequencing error and quality issue: worse at the end of
reactions So can’t rely on perfectly identical sequences all the time
Sequences derived from one strand of DNA Need to take orientations of reads into account
Non-random sequencing of DNA
Presence of repeats
http://www.cbcb.umd.edu/research/assembly_primer.shtml
Correct layout
Mis-assembly
Overlap-layout consensus
The relationships between reads can be represented as a graph Nodes (vertices): reads Edges (lines): connecting “overlapping reads”
Goal: identifying a path through that graph that visits each node exactly once
1234
1
2
3
4
Genome
http://en.wikipedia.org/wiki/Image:Hamilton_path.gif
Example: Gene prediction
How can we identify functional elements in the genomes?
How can we assign functions to these elements?
How can we determine/predict the structures of these elements?
How can we reconstruct networks describing the relationships and dynamics between these elements?
How can we link genotypes to phenotypes?
Characteristic of protein coding genes
Similarity to other genes Assuming there is some level of conservation. Substitutions that change amino acids vs. those that won’t.
http://www.mun.ca/biology/scarr/MGA2_03-20.html
Hidden Markov Model and gene finding
Goal: Choose a path that maximize the probability that you will
enjoy the trip (or the other way around if you wish)
How is the probability determined?
p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2
Example: Sequence alignment
Align retinol-binding protein and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
>RBPMKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIV
>lactoglobulinMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI
Goal of PSA
Find an alignment between 2 sequences with the maximum score
Extreme value distribution
Normal vs. extreme value distribution
x
pro
bab
ilit
y extreme value distribution
normal distribution
0 1 2 3 4 5-1-2-3-4-5
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0
Example: Microarray
A solid support (e.g. a membrane or glass slide) on which DNA of known sequence is deposited in a grid-like fashion
http://shadygrove.umbi.umd.edu/microarray/Microarray.gif
Microarray data analysis
A simplified pipeline
http://www.microarray.lu/images/overview_1.jpg
What’s in the cel files
Intensities of perfect and mismatch probes
#### Dimension of the data matrixnrow(M); ncol(M)
### Perfect matchpm <- pm(M) # perfect match intensitiesdim(pm) # dimension of the pm matrixpm[1:5,] # the first five columnssummary(pm) # summary stat for the pm matrix
GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL GSM131161.CEL GSM131162.CEL[1,] 252.5 267.0 349.0 424.8 213.5 237.8[2,] 138.0 129.8 147.5 335.5 215.3 142.3[3,] 172.3 155.5 174.8 411.8 241.0 128.3[4,] 163.3 142.8 155.5 494.3 225.5 119.5[5,] 259.5 257.3 245.3 505.5 308.8 217.0
GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL Min. : 56.3 Min. : 67.5 Min. : 69.5 Min. : 96.0 1st Qu.: 144.3 1st Qu.: 143.3 1st Qu.: 157.3 1st Qu.: 303.6 Median : 212.5 Median : 215.0 Median : 234.8 Median : 414.5 Mean : 423.1 Mean : 437.5 Mean : 458.4 Mean : 648.2 3rd Qu.: 383.5 3rd Qu.: 397.8 3rd Qu.: 426.0 3rd Qu.: 637.0 Max. :39818.5 Max. :39268.0 Max. :28628.0 Max. :24854.5
Probe intensity behaviors between arrays
Distributions vary widely between experiments
### Summarize the intensitypar(mfrow=c(1,2)) # get a plotting region with 1 row, 2 colhist(M) # generate log2 histogramsboxplot(M) # generate log2 boxplots
log inte
nsi
ty
Example: Identification of cis-elements
The on-off switches and rheostats of a cell operating at the gene level.
They control whether and how vigorously that genes will be transcribed into RNAs.
http://genomicsgtl.energy.gov/science/generegulatorynetwork.shtml
Motif model: Position Frequency Matrix (PFM)
fb,i : freuqnecy of a base b occurred at the i-th position
D’haeseleer (2006) Nature Biotech. 24:423
Motif model: Position Weight Matrix (PWM)
Suppose pA,T = 0.32 and pG,C = 0.18 (Arabidopsis thaliana)
b
bibib p
NpnW
)1/(ln ,
,
1 2 3 4 5
A 8 0 4 4 2
T 0 0 0 2 2
G 0 8 4 2 2
C 0 0 0 0 2
Position Frequency Matrix
1 2 3 4 5
A 1.1 -2.2 0.4 0.4 -0.2
T -2.2 -2.2 -2.2 -0.2 -0.2
G -2.2 1.6 1.0 0.3 0.3
C -2.2 -2.2 -2.2 -2.2 0.3
Position Wight Matrix
Example: Cis-regulatory logic
Based on a high confidence set of binding sites: 3,353 interactions
between 116 regulators and 1,296 promoters
Harbison et al. (2004) Nature 43:99
Identification of putative cis elements
Pearson's correlation coefficient as the similarity measure. k-mean clustering to identify co-regulated genes. Motifs identified only with AlignACE
Beer and Tavazoie (2004) Cell 117:185
Bayesian network
Bayes' theorem
Bayesian network
Charniak (1991) Bayesian networks without tears
)(
)()|()|(
BP
APABPBAP
n
iiin XparentsXPXXP
11 |,...,
Final example: Relationships between sequences
Sanger and colleagues (1950s): 1st sequence
Insulin from various mammals
Trees
An acyclic, un-directed graph with nodes and edges
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
Li 1997. Molecular Evolution. p101
one unit
6
1
2
2
1
A
BC
2
1
2
D
E
Operationaltaxonomic unit
Ancestraltaxonomic units
Externalbranch
Internal branch
Enumerating trees
Suppose there are n OTUs (n ≥ 3) Bifurcating rooted trees:
Unrooted trees:
For 10 OTUs 3.4x107 possible rooted trees 2.0x106 possible unrooted trees
http://w3.uniroma1.it/cogfil/philotrees.jpg
)!3(2
)!52(3
n
nN
nU
)!3(2
)!32(3
n
nN
nR
Impacts of genomics and bioinformatics
New ways to ask and answer question? Hypothesis driven vs. data driven A matter of scale A matter of integration Quantitative emphasis Multi-displinary approaches
How is genomics different from genetics? Whole genome approach versus a few genes Investigations into the structure and function of very large
numbers of genes undertaken in a simultaneous fashion. Genetics looks at single genes, one at a time, as a snapshot. Genomics is trying to look at all the genes as a dynamic
system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense
The END
...
Recommended