The MoBIoS Project Mo lecular B iological I nformation S ystem

The MoBIoS ProjectMolecular Biological Information System

Daniel P. MirankerDept. of Computer Sciences &

Center for Computational Biology and Bioinformatics

University of Texas

Weijia Xu, Rui Mao, Will Briggs, Smriti Ramakrishnan, Shu Wang, Lulu Zhang

Problem:

In Life Sciencses, database management systems (DBMS) serve as glorified file managers.

Little use of sophisticated data and pattern-based retrieval

Real scientific and technological problems

When biological data is put in to an RDBMS

• Primary data is stored in text or blob fields– Annotations may be relational

• Data retrieval – Filter DB, sequential dump, O(n), to utilities

• E.g. BLAST,

Organism Function Sequence

Yeast membrane AACCGGTTT

Yeast mitosis TATCGAAA

E. Coli membrane AGGCCTA

Linear Data Scans, O(n), Endemic in Life Sciences

Sequences: DNA, RNA, Protein databases

Mass Spectra proteomics

Small Molecules & Protein Structure Protein interaction Rational drug design

Pathways (graphs) Phylogenies (graphs, trees in particular)

Scope: To Find Common Ground Both Biology and DBMS’ Have to Move

Biological

Information

System

Metric-Space Database as the Common Ground

Metric Space is a pair, M=(D,d),

where D is a set of points d is [metric] distance function with the following

properties:

d(x,y) = d (y,x) (symmetry) d(x, y) > 0, d(x,x) = 0 (non negativity) d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

Definition - By Analogy

A Spatial Database Management System:

Extend relational DBMS Special indexes for 2D and

3D data; k-d and R-trees New data types

Geographic information systems Topographic maps Buildings and the like

A Metric-Space Database Management System

Extend Relational DBMS Special indexes for metric-

spaces New data types

Biological information system Life science data types

Develop index structures to support distance & nearest-neighbor queries

• Well studied in main-memory– But by no means a closed problem

• In databases (external/disk based methods)– Embryonic– Many myths

• Often assumed to be the basis of multimedia database systems

How to build a metric-space index

• Three algorithmic classes [Tasan, Ozsoyoglu 04]

– Vantage points– Hyperplanes– Bounding spheres

Vantage Point Method [Burkhard&Keller73]

Vantage Point Method

Choose a point,VP

And a radius, R

Vantage Point Method

Choose a point,VP

And a radius,R

• Given VP, R

The predicates

• d(VP,x) < R

• d(VP,x) R

Divide the set into two equal halves

• apply recursively

Query, q, range r

if• d(q,VP) > R + rthen• all neighbors are outside the sphere

Multi-vantage point method

• Consider d(VPi, x) a projection onto an axis

• Looks like a k-d tree– Choose number k & d

• Solved problem; M-trees [Ciaccia et.al. 96, 97]

– I can’t get them to work on anything but their original synthetic data generator

• Good choice for vantage points is to find corners[Yianilos93] (farthest-first clustering)– Might be true for euclidean spaces– Early result, not true for our data

• High dimensional indexing always asymptotically reduces to linear scans.– Formal result based on an assumption of uniform data

distributions.

#di st . cal . : RBT VS. GHT VS. MVPT

0 2 4 6 8 10radi us

#dist cal.

RBTGHTMVPT

#I / O, RBT VS. GHT. VS MVPT

0 2 4 6 8 10radi us

Figure 9. Comparison of metric-space index structures: RBT, GHT, and VPT

Comparison of Three Methods of Metric-Space Indexing

Open problems

• Is there a general metric-space index structure that is generally good for most work loads.– We are optimistic mvp tree’s – further tuning will be a

useful answer

– Hyperplane methods are fair game – there is circumstantial evidence that that is key component in Google’s search engine.

• No work addresses clustering data pages on disk.• Metric-space join algorithms

Biological Models are Usually Based on Similarity

Similarity• Biologist like scoring functions that reward each

similar feature with a positive number• Intuitive

Distance:• More Similar smaller numbers• Identical 0

But Do Metric Models Capture Biology?But Do Metric Models Capture Biology? • Metrics are a subset of possible mathematical models

Sequence Problem 1

Sequence similarity based on weighted edit distance

Accepted weight matrices, PAM & BLOSSUM, are not metric

Log-odd matrices – negative values

Defy simple algebraic normalization[TaylorJones93,Linialetal97]

Our First Result: mPAM [Xu&Miranker04]

Dayhoffetal’s PAM Derivation[74]

• Took a set of closely related protein sequences

• Developed a phylogenetic tree

• Counted substitutions to transform one sequence to another

• Tree determines a measure of time

PAM vs. mPAM: t = 1/f

Using original substitution counts

PAM: frequency of substitution

S(a,b|t) = log P(b|a,t)/qb

mPAM: expected time between substitutions

D(a,b) = 1/log(1 – (P(a,x)P(b,x))x

Sequence Problem 2

• Sequences long units (identity for storage and retrieval)– Genes– Chromosomes

• Analysis comprises comparing small substrings

Soln: Sequence View

• New view type

• Breaks sequences into q-grams

create SEQUENCEVIEW rice_sview asSELECT CREATE FRAGMENTS (…, 3, 1)FROM …WHERE …

USING HAMMING-DISTANCE

Materialize as an Index

Genomes

Rowid Seq

R1 CAACA

R2 ATCAAA

R3 …

Rowd Offset Logical Fragment

R1 1 A C A

R1 2 C A A

R1 3 A A C

R1 4 A C A

… … …

R2 1 A T C

R2 2 T C A

R2 3 C A A

R2 4 A A A

… … …

D(ACA)

≤ 1D(CAA)

≤ 0D(ATC)

D(AAA)≤ 2

Status

• Started with McKoi– A Java open source object-relational DBMS– (Think of Postgress written in Java)

• AddedBiological data typesMetric-space indexExtending SQL engine (in progress)

Computed in MoBIoS Compare Arabidopsis Genome X Rice Genome

1. Locate nucleotide patterns of form

primer pair candidate

2. Eliminate non-unique primer candidates3. Merge overlapping primer candidates

• Usual implementations O(n2), n = 109

18 Matching Nucleotides

Rice Gap 400 – 3000 Long Arab. Gap 400 – 3000 Long

18 Matching Nucleotides

mSQL Query to locate candidate primer pairsSELECT merge(R1.fragment, A1.fragment)

G1_sview R1, G1_sview R2, G2_sview A1, G2_sview A2

distance(‘HAMMINGDISTANCE', R1.fragment, A1.fragment) <= 1.0 AND distance(‘HAMMINGDISTANCE', R2.fragment, A2.fragment) <= 1.0 AND

(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) >= 400 AND

(FRAGOFFSET(R2.fragment)-FRAGOFFSET(R1.fragment)) <= 3000 AND

(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) >= 400 AND

(FRAGOFFSET(A2.fragment)-FRAGOFFSET(A1.fragment)) <= 3000

GROUP BY R1.fragment, A1.fragment;

Query Plan Arab. Genome, O(n) Rice Genome, O(m)

Offline: Build Sequence View O(n log n)

Compare O(mlogn) Indexed Nested Loop

Eliminate Duplicates

Eliminate Low ComplexityPrimers (LZ compression)

Merge Overlapping Primers

~10,000 conserved primer pairs candidates

Preliminary Results• Found 13,418 possible primer pairs from MoBIoS• 100 best candidates BLASTed for matches in GenBank

– 15 matched other plant genes and the primers– At least 2 of 15 showed potential after PCR amplification against

Helianthus and Phalaenopsis.

MoBIoS Architecture(Molecular Biological Information System)

Analysing Mass-Spectra

Spectrum = Histogram of Mass/Charge Ratios of a collection peptides

Similarity = Shared peaks count = Inner Product

(0100101) • (0111100) = 2

Cosine Distance Approx. Inner Product

Drs= 1 – xrx’s/(x’rxr)1/2(x’sxs)1/2

shown store and retrieve mass-spectra

- using cosine distance, and it scales

mSQL Query for Protein Identification by Mass-Spec.

Signature Database Look

SELECT Prot.accesion_id, Prot.sequenceFROM protein_sequences Prot, digested_sequences DS,

mass_spectra MS

WHEREMS.enzyme = DS.enzyme = E and

Cosine_Distance(S, MS.spectrum, range1) and

DS.accession_id = MS.accession_id = Prot.accesion_id and

DS.ms_peak = P and MPAM250(PS, DS.sequence, range2);

Matching Electrostatic Shape of Molecules

Still benefit from grid-services: Intermittently, but regularly compile (recluster) the indices O(nlog n), n > 106

Rational drug design: O(log n) finite element solutions to traverse search tree. Make a service call to the grid for these operations only Mirror data contents to minimize I/O Since need is intermittant, one grid serves many MoBIoS servers

G R I D

Mirror DB-Contents

MoBIoSServer

recluster

New index Shape match (FEM)

Distance(real)

High speed I/O

Hyper-planes [Ulhmann91]

• If d(x,h1) < d(x,h2) then x assigned to h1h1

Develop a Hierarchical Clustering

Hierarchy of Bounding spheres, (center, radius), • Bounding spheres may overlap

• Inspired by R-trees

The MoBIoS Project Mo lecular B iological I nformation S ystem

Documents

UNIVERSITY HISTORY - 東京農工大学intl/download/shorteng.pdf · Five Departments 院 iological Production Applied iological Science Environmental and Natural Resource Sciences

S chool of B iological S ciences

1-2-1 N uclear B iological C hemical AIRCREW CBR ENSEMBLE Donning Procedures Chemical Biological Radiological

The effects of temperature and light on phytoplankton production Jan Bissinger S chool of B iological S ciences S chool of B iological S ciences

Clustering Sequences in a Metric Space The MoBIoS Project Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu Department of Computer Sciences,

Past and Present Primates Human iological Science: Unit 4 (ATAR) · 2019. 9. 10. · es Past and Present Primates Human iological Science: Unit 4 (ATAR) 3. Shared characteristics

HE OURNAL OF IOLOGICAL HEMISTRY © 1999 by The American Society

M odelling C omplex B iological S ystems in the C ontext of G … · 2010-03-26 · M odelling C omplex B iological S ystems in the C ontext of G enomics May 3 rd - 7th, 2010 Edited

Secondary Educationeducation.msu.edu/academics/undergraduate/documents/... · 2016-08-24 · 5 Secondary Teaching Minors Agriculture, Food & Natural Resources Education1 Arabic iological

The Vascular, Rheologic, Biochemical, and Mo lecular ...erwinwyss.ch/wp-content/uploads/2012/06/Folien-Dr.-G.-D.-Deplazes.pdfThe Vascular, Rheologic, Biochemical, and Mo lecular Environment

Section 3.4 iological Resources

M odelling C omplex B iological S ystems in the C ontext

eneral rganic, and iological Cheistry

Science & Medicine Graduate Research Scholars … & Applied Economics Agroecology Agronomy Animal Science acteriology iochemistry (IPi) iological Systems Engineering iophyiscs ancer

R eexaming t he b iological r ace d ebate

N ATURE U NBOUND L ESSON 8 – D IVERSITY & D ISTURBANCE OF B IOLOGICAL C OMMUNITIES Conservation Mr. Dieckhoff

B iological R esearch i n C anisters BRIC-LED

LAKE CITY F/B EHAVIORAL AND IOLOGICAL EFFECTS OF …

IOLOGICAL LASSIFICATION - Prashanth Ellinacbsebooks.prashanthellina.com › class_11.Biology.Biology › CH...16 BIOLOGY Since the dawn of civilisation, there have been many attempts

Evolution,B iological Communities and Species Interactions