View
36
Download
0
Category
Preview:
DESCRIPTION
M o B I o S M o B I o S. S o I B o M S o I B o M. The MoBIoS Project Mo lecular B iological I nformation S ystem. Daniel P. Miranker Dept. of Computer Sciences & Center for Computational Biology and Bioinformatics University of Texas. - PowerPoint PPT Presentation
Citation preview
The MoBIoS ProjectMolecular Biological Information System
Daniel P. MirankerDept. of Computer Sciences &
Center for Computational Biology and Bioinformatics
University of Texas
Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang
Problem:
In Life Sciencses, database management systems (DBMS) serve as glorified file managers.
Little use of sophisticated data and pattern-based retrieval
Real scientific and technological problems
When biological data is put in to an RDBMS
• Primary data is stored in text or blob fields– Annotations may be relational
• Data retrieval – Filter DB, sequential dump, O(n), to utilities
• E.g. BLAST,
Organism Function Sequence
Yeast membrane AACCGGTTT
Yeast mitosis TATCGAAA
E. Coli membrane AGGCCTA
Linear Data Scans, O(n), Endemic in Life Sciences
Sequences: DNA, RNA, Protein databases
Mass Spectra proteomics
Small Molecules & Protein Structure Protein interaction Rational drug design
Pathways (graphs) Phylogenies (graphs, trees in particular)
Scope: To Find Common Ground Both Biology and DBMS’ Have to Move
DBMS
Biological
Information
System
Metric-Space Database as the Common Ground
Metric Space is a pair, M=(D,d),
where D is a set of points d is [metric] distance function with the following
properties:
d(x,y) = d (y,x) (symmetry) d(x, y) > 0, d(x,x) = 0 (non negativity) d(x,z) <= d(x,y) + d(y,z) (triangle inequality)
x
y z
Definition - By Analogy
A Spatial Database Management System:
Extend relational DBMS Special indexes for 2D and
3D data; k-d and R-trees New data types
Geographic information systems Topographic maps Buildings and the like
A Metric-Space Database Management System
Extend Relational DBMS Special indexes for metric-
spaces New data types
Biological information system Life science data types
Develop index structures to support distance & nearest-neighbor queries
• Well studied in main-memory– But by no means a closed problem
• In databases (external/disk based methods)– Embryonic– Many myths
• Often assumed to be the basis of multimedia database systems
How to build a metric-space index
• Three algorithmic classes [Tasan, Ozsoyoglu 04]
– Vantage points– Hyperplanes– Bounding spheres
Vantage Point Method [Burkhard&Keller73]
Vantage Point Method
Choose a point,VP
And a radius, R
Vantage Point Method
Choose a point,VP
And a radius,R
• Given VP, R
The predicates
• d(VP,x) < R
• d(VP,x) R
Divide the set into two equal halves
• apply recursively
Query, q, range r
qr
Query, q, range r
VP
R
q
r
if• d(q,VP) > R + rthen• all neighbors are outside the sphere
Multi-vantage point method
Multi-vantage point method
• Consider d(VPi, x) a projection onto an axis
• Looks like a k-d tree– Choose number k & d
Myths
• Solved problem; M-trees [Ciaccia et.al. 96, 97]
– I can’t get them to work on anything but their original synthetic data generator
• Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering)– Might be true for euclidean spaces– Early result, not true for our data
• High dimensional indexing always asymptotically reduces to linear scans.– Formal result based on an assumption of uniform data
distributions.
#di st . cal . : RBT VS. GHT VS. MVPT
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2 4 6 8 10radi us
#dist cal.
RBTGHTMVPT
#I / O, RBT VS. GHT. VS MVPT
0
100
200
300
400
500
600
700
800
0 2 4 6 8 10radi us
#IO
RBT
GHT
MVPT
Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT
Comparison of Three Methods of Metric-Space Indexing
Open problems
• Is there a general metric-space index structure that is generally good for most work loads.– We are optimistic mvp tree’s – further tuning will be a
useful answer
– Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine.
• No work addresses clustering data pages on disk.• Metric-space join algorithms
Biological Models are Usually Based on Similarity
Similarity• Biologist like scoring functions that reward each
similar feature with a positive number• Intuitive
Distance:• More Similar smaller numbers• Identical 0
But Do Metric Models Capture Biology?But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models
.
Sequence Problem 1
Sequence similarity based on weighted edit distance
Accepted weight matrices, PAM & BLOSSUM, are not metric
Log-odd matrices – negative values
Defy simple algebraic normalization[TaylorJones93,Linialetal97]
Our First Result: mPAM [Xu&Miranker04]
Dayhoffetal’s PAM Derivation[74]
• Took a set of closely related protein sequences
• Developed a phylogenetic tree
• Counted substitutions to transform one sequence to another
• Tree determines a measure of time
PAM vs. mPAM: t = 1/f
Using original substitution counts
PAM: frequency of substitution
S(a,b|t) = log P(b|a,t)/qb
mPAM: expected time between substitutions
D(a,b) = 1/log(1 – (P(a,x)P(b,x))x
Sequence Problem 2
• Sequences long units (identity for storage and retrieval)– Genes– Chromosomes
• Analysis comprises comparing small substrings
Soln: Sequence View
• New view type
• Breaks sequences into q-grams
create SEQUENCEVIEW rice_sview asSELECT CREATE FRAGMENTS (…, 3, 1)FROM …WHERE …
USING HAMMING-DISTANCE
Materialize as an Index
Genomes
Rowid Seq
R1 CAACA
R2 ATCAAA
R3 …
Rowd Offset Logical Fragment
R1 1 A C A
R1 2 C A A
R1 3 A A C
R1 4 A C A
… … …
R2 1 A T C
R2 2 T C A
R2 3 C A A
R2 4 A A A
… … …
D(ACA)
≤ 1D(CAA)
≤ 0D(ATC)
≤ 1
D(AAA)≤ 2
{
{
Status
• Started with McKoi– A Java open source object-relational DBMS– (Think of Postgress written in Java)
• AddedBiological data typesMetric-space indexExtending SQL engine (in progress)
Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome
1. Locate nucleotide patterns of form
primer pair candidate
2. Eliminate non-unique primer candidates3. Merge overlapping primer candidates
• Usual implementations O(n2), n = 109
Rice
Arab.
18 Matching Nucleotides
Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long
18 Matching Nucleotides
mSQL Query to locate candidate primer pairsSELECT merge(R1.fragment, A1.fragment)
FROM
G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2
WHERE
distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND
(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND
(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000
GROUP BY R1.fragment, A1.fragment;
Query Plan Arab. Genome, O(n) Rice Genome, O(m)
Offline: Build Sequence View O(n log n)
Compare O(mlogn) Indexed Nested Loop
Eliminate Duplicates
Eliminate Low ComplexityPrimers (LZ compression)
Merge Overlapping Primers
~10,000 conserved primer pairs candidates
Preliminary Results• Found 13,418 possible primer pairs from MoBIoS• 100 best candidates BLASTed for matches in GenBank
– 15 matched other plant genes and the primers– At least 2 of 15 showed potential after PCR amplification against
Helianthus and Phalaenopsis.
MoBIoS Architecture(Molecular Biological Information System)
Analysing Mass-Spectra
Spectrum = Histogram of Mass/Charge Ratios of a collection peptides
Similarity = Shared peaks count = Inner Product
(0100101) • (0111100) = 2
Cosine Distance Approx. Inner Product
Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2
shown store and retrieve mass-spectra
- using cosine distance, and it scales
mSQL Query for Protein Identification by Mass-Spec.
Signature Database Look
SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,
mass_spectra MS
WHEREMS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id = MS.accession_id = Prot.accesion_id and
DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);
Matching Electrostatic Shape of Molecules
Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106
Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers
G R I D
Mirror DB-Contents
MoBIoSServer
recluster
New index Shape match (FEM)
Distance(real)
High speed I/O
Hyper-planes [Ulhmann91]
• If d(x,h1) < d(x,h2) then x assigned to h1h1
h2
x
Develop a Hierarchical Clustering
Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap
• Inspired by R-trees
B
F D
EA
C
Recommended