Life Sciences: Data Revolution

Life Sciences: Life Sciences: Data RevolutionData Revolution

Building Gene Expression DatabasesBuilding Gene Expression Databases

Session : 40382

Microarray Centre

MRC Clinical Sciences Centre and Imperial College, UK

Mahendra Navarange

AgendaAgenda

What is Life Science?What is Life Science?

MiMiR : database for gene expression dataMiMiR : database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics

System requirementsSystem requirements

Design issuesDesign issues

Code snippetsCode snippets

What is Life Sciences ?What is Life Sciences ?

Includes Includes BiologyBiologyBioTechnologyBioTechnologyChemistryChemistryPharmaceuticalsPharmaceuticalsAgriculture / Plant ScienceAgriculture / Plant ScienceEnvironmental SciencesEnvironmental Sciences????????

ObjectiveObjectiveUnderstand the molecular and evolutionary basis Understand the molecular and evolutionary basis of living organisms of living organisms

Focus AreasFocus AreasGenomicsGenomics

Human Genome ProjectHuman Genome ProjectDraft published in 2000 Draft published in 2000 Finished version on 14 April Finished version on 14 April 20032003Sequencing data doubles every Sequencing data doubles every yearyear

TranscriptomicsTranscriptomicsStudy of transcription (gene Study of transcription (gene expression)expression)

ProteomicsProteomicsStudy of translation (protein Study of translation (protein synthesis)synthesis)

Courtesy F. Hoffmann-La Roche Ltd.

Data…Data…DataData…Data…Data

Sanger Centre 5TB Sanger Centre 5TB

Celera ~ 100TB+ (2001)Celera ~ 100TB+ (2001)

0

100

200

300

400

500

600

700

1999 2000 2001 2002 2003 2004 2005 2006 2007

TB

Data Revolution in Life SciencesData Revolution in Life Sciences

Impact of technologyImpact of technologyHigh throughput platforms (HTP)High throughput platforms (HTP)

– RoboticsRobotics– MiniaturisationMiniaturisation

Data driven scienceData driven scienceDatawarehousing technologiesDatawarehousing technologies

Data mining and visualisation softwareData mining and visualisation software

Life Sciences Information Technology

DatabasesDatabases

GenomicsGenomicsSangerSanger

NCBINCBI

TIGRTIGR

KEGGKEGG

TranscriptomicsTranscriptomicsArrayExpressArrayExpress

ProteomicsProteomicsProtein Databank (PDB)Protein Databank (PDB)

SWISSPROTSWISSPROT

EntrezEntrez

Using Life Sciences DataUsing Life Sciences Data

identify causes of genetic identify causes of genetic diseasesdiseasesdiscover new drug discover new drug compounds compounds personalised medicinepersonalised medicinedevelop new diagnosticsdevelop new diagnostics

Target Identification

Target Validation

HTP Screening

Hits Leads LeadsClinical Trials

FDA

Drug Discovery Pipeline

Life Sciences : The FutureLife Sciences : The Future

“…“…..biology is changing from a purely ..biology is changing from a purely laboratory-based science to an information laboratory-based science to an information based science.”based science.”

Eric Lander,Eric Lander,

Director, Whitehead Institute Director, Whitehead Institute MITMIT

AgendaAgenda

What is Life Sciences ? What is Life Sciences ?

MiMiR: database for gene expression dataMiMiR: database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics




TranscriptomicsTranscriptomics

Comparing gene expression Comparing gene expression across databasesacross databasesCollaborate to share expertiseCollaborate to share expertiseBenefitsBenefits

DiagnosticsDiagnosticsScreen target drug Screen target drug compoundscompoundsIdentify toxic side effectsIdentify toxic side effectsScreen patients for clinical Screen patients for clinical trialstrials

WorkflowWorkflow

Experiment design HTP Data

Preliminary Analysis

Further Analysis

CollaborationNCBIGO

Local DB

Literature

HTP Microarray Platform : HardwareHTP Microarray Platform : Hardware

Courtesy Affymetrix Inc., Dell Inc

Microarray Data AcquisitionMicroarray Data Acquisition

Courtesy Affymetrix Inc.Courtesy Fisher Scientific

Microarray DataMicroarray Data

High density High density microarraymicroarray

~ 500,000 spots of~ 500,000 spots of

~18 ~18 µm sizeµm size

>20,000 genes>20,000 genes

Typical file size 45MBTypical file size 45MB

No. of files produced No. of files produced in typical experiment in typical experiment 10-20.10-20.

Courtesy Affymetrix Inc.

Life Sciences Data ExplosionLife Sciences Data Explosion

Data CharacteristicsData CharacteristicsImage data generated by HTP platforms, Image data generated by HTP platforms, annotation by researchersannotation by researchersLarge volume and sizeLarge volume and sizeVaried data typesVaried data types

Datawarehousing challengesDatawarehousing challengesNon-summarisableNon-summarisableHigh dimensionalityHigh dimensionalityLimited knowledge of underlying biological Limited knowledge of underlying biological processesprocessesNo standard industry data models or best practicesNo standard industry data models or best practices

AgendaAgenda


MiMiR: database for gene expression dataMiMiR: database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics




System RequirementsSystem Requirements

Seamless data integrationSeamless data integration

Handle wide range of datatypesHandle wide range of datatypes

Processor intensive and I/O intensiveProcessor intensive and I/O intensive

Exponential growth in data storageExponential growth in data storage

Open architecture, collaborationOpen architecture, collaboration

System RequirementsSystem Requirements

Rapid changes – new databases, Rapid changes – new databases, technologies and instrumentstechnologies and instruments

Competitive pressures, quick response, Competitive pressures, quick response, low access timeslow access times

Plug and play capabilityPlug and play capability

SecuritySecurity

MIMIcroarray Data croarray Data MIMIning ning RResourceesource

MiMiR – Microarray DatawarehouseMiMiR – Microarray Datawarehouse~250GB. Expected to double in next few ~250GB. Expected to double in next few monthsmonths

~2500 images, over 1500 BioAssays~2500 images, over 1500 BioAssays

52 tables, largest table 15GB52 tables, largest table 15GB

InfrastructureInfrastructureOracle 9i Release 1 on Windows 2000Oracle 9i Release 1 on Windows 2000

Dell PowerEdge Quad Processor, 2 GB Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard diskmemory, 400 GB hard disk

1 TB NAS capacity1 TB NAS capacity

Requirements vs. SolutionsRequirements vs. Solutions

Integrate different types of data sourcesIntegrate different types of data sourcesUse of XML for data exchangeUse of XML for data exchange

Use of Oracle UltraSearch Use of Oracle UltraSearch

Efficient data retrievalEfficient data retrievalStringent response time standards on proceduresStringent response time standards on procedures

Indexed Organised Tables, PartitioningIndexed Organised Tables, Partitioning

SecuritySecurityFirewallFirewall

Single Sign-On servers (in progress)Single Sign-On servers (in progress)

Rapid change managementRapid change managementBC4J framework, JdeveloperBC4J framework, Jdeveloper

Extreme programming, prototypingExtreme programming, prototyping

MiMiR System ArchitectureMiMiR System Architecture

MiMiR

Images

Annotation

Spot Info

Ext Ref

Blast

MAGE-ML

Application Server

JDeveloper 9iAS Admin

XSQL XSU XDK BC4J JClient

JSPArrayExpress Private

Oracle Products UsedOracle Products Used

Oracle 9i Database Server/Client (Release1)Oracle 9i Database Server/Client (Release1)PartitioningPartitioningJoin indexingJoin indexing

Oracle 9i JDeveloper (9.0.2)Oracle 9i JDeveloper (9.0.2)Oracle 9i Application Server (BC4J)Oracle 9i Application Server (BC4J)Oracle XML features Oracle XML features

Oracle PL/SQL packages for XMLOracle PL/SQL packages for XMLOracle XSQL publishing frameworkOracle XSQL publishing frameworkXDK (DOMParser and SAXParser)XDK (DOMParser and SAXParser)XSUXSU

Oracle Data Mining (Future)Oracle Data Mining (Future)Oracle Collaboration Suite (Future)Oracle Collaboration Suite (Future)

Why Oracle ?Why Oracle ?

Readily scalableReadily scalable

Manage wide variety of data typesManage wide variety of data types

Integrated development toolsIntegrated development tools

Support XML and JavaSupport XML and Java

High performance middlewareHigh performance middleware

Secure collaborationSecure collaboration

AgendaAgenda


MiMir : database for gene expression dataMiMir : database for gene expression dataData acquisition and profilingData acquisition and profilingSystem requirementsSystem requirements



Oracle and XML :Design IssuesOracle and XML :Design Issues

Storage Storage Storing XML in tablesStoring XML in tables

Storing XML in CLOBsStoring XML in CLOBs

HybridHybrid

GenerationGenerationXDK for Java, PL/SQLXDK for Java, PL/SQL

XSUXSU

TransformationTransformationXSL StylesheetXSL Stylesheet

ViewsViews

ProcessingProcessingXDK DOMParser XDK DOMParser

XDK SAXParserXDK SAXParser

SearchingSearchingXPATHXPATH

Oracle Text Oracle Text

PublishingPublishingXSQL publishing XSQL publishing frameworkframework

XSLXSL

Oracle and XML : XSQL ExampleOracle and XML : XSQL Example

<?xml version="1.0" encoding='windows-1252'?><?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?><xsql:query connection="micro" xmlns:xsql="urn:oracle-xsql">select * from array</xsql:query>

Oracle and XML: Design IssuesOracle and XML: Design Issues

AgendaAgenda


MiMir : database for gene expression dataMiMir : database for gene expression dataData profilingData profilingSystem requirementsSystem requirementsDesign issuesDesign issues


An ExampleAn Example

Creating XML from 500,000 Creating XML from 500,000 records in the databaserecords in the database

Solution 1Solution 1

Using XSU Java API to get XMLDOM.Using XSU Java API to get XMLDOM.1)1) conn=createConnection.createConnection();conn=createConnection.createConnection();

2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";"WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";

3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);

4) 4) q1.keepCursorState(true); q1.keepCursorState(true);

5) 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();

6) XMLDocument.print(out);6) XMLDocument.print(out);

Solution 2Solution 2

Using XSU Java API to get Using XSU Java API to get XMLString.XMLString.1)1) conn=createConnection.createConnection();conn=createConnection.createConnection();

2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";"WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";

3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);4) 4) q1.keepCursorState(true);q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();6) # XMLDocument.print(out);7) 7) System.out.println(q1.getXMLString());System.out.println(q1.getXMLString());

Solution 3Solution 3Using dbms_xmlquery package to get Using dbms_xmlquery package to get XML output from SQLXML output from SQL

Select Select dbms_xmlquery.getXMLdbms_xmlquery.getXML(‘select * from (‘select * from IMAGE_QUANTITATION where IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dualquant_filename=‘’PMB2002011001Aaa’’’) from dual

<?xml version = '1.0'?> <?xml version = '1.0'?> <ROWSET> <ROWSET> <ROW num="1"> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG> <PAIRS_IN_AVG>19</PAIRS_IN_AVG>

SummarySummary

Life sciences is generating enormous Life sciences is generating enormous amount of data using HTPamount of data using HTPThe data is non-summarisable, The data is non-summarisable, distributed and has varied data typesdistributed and has varied data typesData integration and secure Data integration and secure collaboration is key to successcollaboration is key to successMiMiRMiMiR

AcknowledgementsAcknowledgements

Dr. Helen CaustonDr. Helen Causton

Prof. Tim AitmanProf. Tim Aitman

Dr. Laurence GameDr. Laurence Game

Helen BanksHelen Banks

Nicola CooleyNicola Cooley

Vihar WadekarVihar Wadekar

Helen FigueiraHelen Figueira

MGED Data Society MGED Data Society (www.mged.org)(www.mged.org)

Session : 40382

Contact: [email protected]

http://microarray.csc.mrc.ac.uk

Life Sciences: Life Sciences: Data RevolutionData Revolution

Building Gene Expression DatabasesBuilding Gene Expression Databases

What Next : What Next : Opportunities for collaboration for development Opportunities for collaboration for development of Knowledge Management Systems forof Knowledge Management Systems forDrug DiscoveryDrug Discovery

Documents

Life Sciences: Data Revolution