37
Life Sciences: Life Sciences: Data Revolution Data Revolution Building Gene Expression Databases Building Gene Expression Databases Session : 40382 Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK Mahendra Navarange

Life Sciences: Data Revolution

  • Upload
    sutton

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Session : 40382. Life Sciences: Data Revolution. Building Gene Expression Databases. Mahendra Navarange. Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK. Agenda. What is Life Science? MiMiR : database for gene expression data - PowerPoint PPT Presentation

Citation preview

Page 1: Life Sciences:  Data Revolution

Life Sciences: Life Sciences: Data RevolutionData Revolution

Building Gene Expression DatabasesBuilding Gene Expression Databases

Session : 40382

Microarray Centre

MRC Clinical Sciences Centre and Imperial College, UK

Mahendra Navarange

Page 2: Life Sciences:  Data Revolution

AgendaAgenda

What is Life Science?What is Life Science?

MiMiR : database for gene expression dataMiMiR : database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics

System requirementsSystem requirements

Design issuesDesign issues

Code snippetsCode snippets

Page 3: Life Sciences:  Data Revolution

What is Life Sciences ?What is Life Sciences ?

Includes Includes BiologyBiologyBioTechnologyBioTechnologyChemistryChemistryPharmaceuticalsPharmaceuticalsAgriculture / Plant ScienceAgriculture / Plant ScienceEnvironmental SciencesEnvironmental Sciences????????

ObjectiveObjectiveUnderstand the molecular and evolutionary basis Understand the molecular and evolutionary basis of living organisms of living organisms

Page 4: Life Sciences:  Data Revolution

Focus AreasFocus AreasGenomicsGenomics

Human Genome ProjectHuman Genome ProjectDraft published in 2000 Draft published in 2000 Finished version on 14 April Finished version on 14 April 20032003Sequencing data doubles every Sequencing data doubles every yearyear

TranscriptomicsTranscriptomicsStudy of transcription (gene Study of transcription (gene expression)expression)

ProteomicsProteomicsStudy of translation (protein Study of translation (protein synthesis)synthesis)

Courtesy F. Hoffmann-La Roche Ltd.

Page 5: Life Sciences:  Data Revolution

Data…Data…DataData…Data…Data

Sanger Centre 5TB Sanger Centre 5TB

Celera ~ 100TB+ (2001)Celera ~ 100TB+ (2001)

0

100

200

300

400

500

600

700

1999 2000 2001 2002 2003 2004 2005 2006 2007

TB

Page 6: Life Sciences:  Data Revolution

Data Revolution in Life SciencesData Revolution in Life Sciences

Impact of technologyImpact of technologyHigh throughput platforms (HTP)High throughput platforms (HTP)

– RoboticsRobotics– MiniaturisationMiniaturisation

Data driven scienceData driven scienceDatawarehousing technologiesDatawarehousing technologies

Data mining and visualisation softwareData mining and visualisation software

Life Sciences Information Technology

Page 7: Life Sciences:  Data Revolution

DatabasesDatabases

GenomicsGenomicsSangerSanger

NCBINCBI

TIGRTIGR

KEGGKEGG

TranscriptomicsTranscriptomicsArrayExpressArrayExpress

ProteomicsProteomicsProtein Databank (PDB)Protein Databank (PDB)

SWISSPROTSWISSPROT

EntrezEntrez

Page 8: Life Sciences:  Data Revolution

Using Life Sciences DataUsing Life Sciences Data

identify causes of genetic identify causes of genetic diseasesdiseasesdiscover new drug discover new drug compounds compounds personalised medicinepersonalised medicinedevelop new diagnosticsdevelop new diagnostics

Target Identification

Target Validation

HTP Screening

Hits Leads LeadsClinical Trials

FDA

Drug Discovery Pipeline

Page 9: Life Sciences:  Data Revolution

Life Sciences : The FutureLife Sciences : The Future

“…“…..biology is changing from a purely ..biology is changing from a purely laboratory-based science to an information laboratory-based science to an information based science.”based science.”

Eric Lander,Eric Lander,

Director, Whitehead Institute Director, Whitehead Institute MITMIT

Page 10: Life Sciences:  Data Revolution

AgendaAgenda

What is Life Sciences ? What is Life Sciences ?

MiMiR: database for gene expression dataMiMiR: database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics

System requirementsSystem requirements

Design issuesDesign issues

Code snippetsCode snippets

Page 11: Life Sciences:  Data Revolution

TranscriptomicsTranscriptomics

Comparing gene expression Comparing gene expression across databasesacross databasesCollaborate to share expertiseCollaborate to share expertiseBenefitsBenefits

DiagnosticsDiagnosticsScreen target drug Screen target drug compoundscompoundsIdentify toxic side effectsIdentify toxic side effectsScreen patients for clinical Screen patients for clinical trialstrials

Page 12: Life Sciences:  Data Revolution

WorkflowWorkflow

Experiment design HTP Data

Preliminary Analysis

Further Analysis

CollaborationNCBIGO

Local DB

Literature

Page 13: Life Sciences:  Data Revolution

HTP Microarray Platform : HardwareHTP Microarray Platform : Hardware

Courtesy Affymetrix Inc., Dell Inc

Page 14: Life Sciences:  Data Revolution

Microarray Data AcquisitionMicroarray Data Acquisition

Courtesy Affymetrix Inc.Courtesy Fisher Scientific

Page 15: Life Sciences:  Data Revolution

Microarray DataMicroarray Data

High density High density microarraymicroarray

~ 500,000 spots of~ 500,000 spots of

~18 ~18 µm sizeµm size

>20,000 genes>20,000 genes

Typical file size 45MBTypical file size 45MB

No. of files produced No. of files produced in typical experiment in typical experiment 10-20.10-20.

Courtesy Affymetrix Inc.

Page 16: Life Sciences:  Data Revolution
Page 17: Life Sciences:  Data Revolution

Life Sciences Data ExplosionLife Sciences Data Explosion

Data CharacteristicsData CharacteristicsImage data generated by HTP platforms, Image data generated by HTP platforms, annotation by researchersannotation by researchersLarge volume and sizeLarge volume and sizeVaried data typesVaried data types

Datawarehousing challengesDatawarehousing challengesNon-summarisableNon-summarisableHigh dimensionalityHigh dimensionalityLimited knowledge of underlying biological Limited knowledge of underlying biological processesprocessesNo standard industry data models or best practicesNo standard industry data models or best practices

Page 18: Life Sciences:  Data Revolution

AgendaAgenda

What is Life Sciences ? What is Life Sciences ?

MiMiR: database for gene expression dataMiMiR: database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics

System requirementsSystem requirements

Design issuesDesign issues

Code snippetsCode snippets

Page 19: Life Sciences:  Data Revolution

System RequirementsSystem Requirements

Seamless data integrationSeamless data integration

Handle wide range of datatypesHandle wide range of datatypes

Processor intensive and I/O intensiveProcessor intensive and I/O intensive

Exponential growth in data storageExponential growth in data storage

Open architecture, collaborationOpen architecture, collaboration

Page 20: Life Sciences:  Data Revolution

System RequirementsSystem Requirements

Rapid changes – new databases, Rapid changes – new databases, technologies and instrumentstechnologies and instruments

Competitive pressures, quick response, Competitive pressures, quick response, low access timeslow access times

Plug and play capabilityPlug and play capability

SecuritySecurity

Page 21: Life Sciences:  Data Revolution

MIMIcroarray Data croarray Data MIMIning ning RResourceesource

MiMiR – Microarray DatawarehouseMiMiR – Microarray Datawarehouse~250GB. Expected to double in next few ~250GB. Expected to double in next few monthsmonths

~2500 images, over 1500 BioAssays~2500 images, over 1500 BioAssays

52 tables, largest table 15GB52 tables, largest table 15GB

InfrastructureInfrastructureOracle 9i Release 1 on Windows 2000Oracle 9i Release 1 on Windows 2000

Dell PowerEdge Quad Processor, 2 GB Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard diskmemory, 400 GB hard disk

1 TB NAS capacity1 TB NAS capacity

Page 22: Life Sciences:  Data Revolution

Requirements vs. SolutionsRequirements vs. Solutions

Integrate different types of data sourcesIntegrate different types of data sourcesUse of XML for data exchangeUse of XML for data exchange

Use of Oracle UltraSearch Use of Oracle UltraSearch

Efficient data retrievalEfficient data retrievalStringent response time standards on proceduresStringent response time standards on procedures

Indexed Organised Tables, PartitioningIndexed Organised Tables, Partitioning

SecuritySecurityFirewallFirewall

Single Sign-On servers (in progress)Single Sign-On servers (in progress)

Rapid change managementRapid change managementBC4J framework, JdeveloperBC4J framework, Jdeveloper

Extreme programming, prototypingExtreme programming, prototyping

Page 23: Life Sciences:  Data Revolution

MiMiR System ArchitectureMiMiR System Architecture

MiMiR

Images

Annotation

Spot Info

Ext Ref

Blast

MAGE-ML

Application Server

JDeveloper 9iAS Admin

XSQL XSU XDK BC4J JClient

JSPArrayExpress Private

Page 24: Life Sciences:  Data Revolution

Oracle Products UsedOracle Products Used

Oracle 9i Database Server/Client (Release1)Oracle 9i Database Server/Client (Release1)PartitioningPartitioningJoin indexingJoin indexing

Oracle 9i JDeveloper (9.0.2)Oracle 9i JDeveloper (9.0.2)Oracle 9i Application Server (BC4J)Oracle 9i Application Server (BC4J)Oracle XML features Oracle XML features

Oracle PL/SQL packages for XMLOracle PL/SQL packages for XMLOracle XSQL publishing frameworkOracle XSQL publishing frameworkXDK (DOMParser and SAXParser)XDK (DOMParser and SAXParser)XSUXSU

Oracle Data Mining (Future)Oracle Data Mining (Future)Oracle Collaboration Suite (Future)Oracle Collaboration Suite (Future)

Page 25: Life Sciences:  Data Revolution

Why Oracle ?Why Oracle ?

Readily scalableReadily scalable

Manage wide variety of data typesManage wide variety of data types

Integrated development toolsIntegrated development tools

Support XML and JavaSupport XML and Java

High performance middlewareHigh performance middleware

Secure collaborationSecure collaboration

Page 26: Life Sciences:  Data Revolution

AgendaAgenda

What is Life Sciences ? What is Life Sciences ?

MiMir : database for gene expression dataMiMir : database for gene expression dataData acquisition and profilingData acquisition and profilingSystem requirementsSystem requirements

Design issuesDesign issues

Code snippetsCode snippets

Page 27: Life Sciences:  Data Revolution

Oracle and XML :Design IssuesOracle and XML :Design Issues

Storage Storage Storing XML in tablesStoring XML in tables

Storing XML in CLOBsStoring XML in CLOBs

HybridHybrid

GenerationGenerationXDK for Java, PL/SQLXDK for Java, PL/SQL

XSUXSU

TransformationTransformationXSL StylesheetXSL Stylesheet

ViewsViews

ProcessingProcessingXDK DOMParser XDK DOMParser

XDK SAXParserXDK SAXParser

SearchingSearchingXPATHXPATH

Oracle Text Oracle Text

PublishingPublishingXSQL publishing XSQL publishing frameworkframework

XSLXSL

Page 28: Life Sciences:  Data Revolution

Oracle and XML : XSQL ExampleOracle and XML : XSQL Example

<?xml version="1.0" encoding='windows-1252'?><!--| Uncomment the following processing instruction and replace| the stylesheet name to transform output of your XSQL Page using XSLT<?xml-stylesheet type="text/xsl" href="YourStylesheet.xsl" ?>--><?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?><xsql:query connection="micro" xmlns:xsql="urn:oracle-xsql">select * from array</xsql:query>

Page 29: Life Sciences:  Data Revolution

Oracle and XML: Design IssuesOracle and XML: Design Issues

Page 30: Life Sciences:  Data Revolution

AgendaAgenda

What is Life Sciences ? What is Life Sciences ?

MiMir : database for gene expression dataMiMir : database for gene expression dataData profilingData profilingSystem requirementsSystem requirementsDesign issuesDesign issues

Code snippetsCode snippets

Page 31: Life Sciences:  Data Revolution

An ExampleAn Example

Creating XML from 500,000 Creating XML from 500,000 records in the databaserecords in the database

Page 32: Life Sciences:  Data Revolution

Solution 1Solution 1

Using XSU Java API to get XMLDOM.Using XSU Java API to get XMLDOM.1)1) conn=createConnection.createConnection();conn=createConnection.createConnection();

2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";"WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";

3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);

4) 4) q1.keepCursorState(true); q1.keepCursorState(true);

5) 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();

6) XMLDocument.print(out);6) XMLDocument.print(out);

Page 33: Life Sciences:  Data Revolution

Solution 2Solution 2

Using XSU Java API to get Using XSU Java API to get XMLString.XMLString.1)1) conn=createConnection.createConnection();conn=createConnection.createConnection();

2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";"WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";

3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);4) 4) q1.keepCursorState(true);q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();6) # XMLDocument.print(out);7) 7) System.out.println(q1.getXMLString());System.out.println(q1.getXMLString());

Page 34: Life Sciences:  Data Revolution

Solution 3Solution 3Using dbms_xmlquery package to get Using dbms_xmlquery package to get XML output from SQLXML output from SQL

Select Select dbms_xmlquery.getXMLdbms_xmlquery.getXML(‘select * from (‘select * from IMAGE_QUANTITATION where IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dualquant_filename=‘’PMB2002011001Aaa’’’) from dual

<?xml version = '1.0'?> <?xml version = '1.0'?> <ROWSET> <ROWSET> <ROW num="1"> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG> <PAIRS_IN_AVG>19</PAIRS_IN_AVG>

Page 35: Life Sciences:  Data Revolution

SummarySummary

Life sciences is generating enormous Life sciences is generating enormous amount of data using HTPamount of data using HTPThe data is non-summarisable, The data is non-summarisable, distributed and has varied data typesdistributed and has varied data typesData integration and secure Data integration and secure collaboration is key to successcollaboration is key to successMiMiRMiMiR

Page 36: Life Sciences:  Data Revolution

AcknowledgementsAcknowledgements

Dr. Helen CaustonDr. Helen Causton

Prof. Tim AitmanProf. Tim Aitman

Dr. Laurence GameDr. Laurence Game

Helen BanksHelen Banks

Nicola CooleyNicola Cooley

Vihar WadekarVihar Wadekar

Helen FigueiraHelen Figueira

MGED Data Society MGED Data Society (www.mged.org)(www.mged.org)

Page 37: Life Sciences:  Data Revolution

Session : 40382

Contact: [email protected]

http://microarray.csc.mrc.ac.uk

Life Sciences: Life Sciences: Data RevolutionData Revolution

Building Gene Expression DatabasesBuilding Gene Expression Databases

What Next : What Next : Opportunities for collaboration for development Opportunities for collaboration for development of Knowledge Management Systems forof Knowledge Management Systems forDrug DiscoveryDrug Discovery