Upload
sutton
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Session : 40382. Life Sciences: Data Revolution. Building Gene Expression Databases. Mahendra Navarange. Microarray Centre MRC Clinical Sciences Centre and Imperial College, UK. Agenda. What is Life Science? MiMiR : database for gene expression data - PowerPoint PPT Presentation
Citation preview
Life Sciences: Life Sciences: Data RevolutionData Revolution
Building Gene Expression DatabasesBuilding Gene Expression Databases
Session : 40382
Microarray Centre
MRC Clinical Sciences Centre and Imperial College, UK
Mahendra Navarange
AgendaAgenda
What is Life Science?What is Life Science?
MiMiR : database for gene expression dataMiMiR : database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics
System requirementsSystem requirements
Design issuesDesign issues
Code snippetsCode snippets
What is Life Sciences ?What is Life Sciences ?
Includes Includes BiologyBiologyBioTechnologyBioTechnologyChemistryChemistryPharmaceuticalsPharmaceuticalsAgriculture / Plant ScienceAgriculture / Plant ScienceEnvironmental SciencesEnvironmental Sciences????????
ObjectiveObjectiveUnderstand the molecular and evolutionary basis Understand the molecular and evolutionary basis of living organisms of living organisms
Focus AreasFocus AreasGenomicsGenomics
Human Genome ProjectHuman Genome ProjectDraft published in 2000 Draft published in 2000 Finished version on 14 April Finished version on 14 April 20032003Sequencing data doubles every Sequencing data doubles every yearyear
TranscriptomicsTranscriptomicsStudy of transcription (gene Study of transcription (gene expression)expression)
ProteomicsProteomicsStudy of translation (protein Study of translation (protein synthesis)synthesis)
Courtesy F. Hoffmann-La Roche Ltd.
Data…Data…DataData…Data…Data
Sanger Centre 5TB Sanger Centre 5TB
Celera ~ 100TB+ (2001)Celera ~ 100TB+ (2001)
0
100
200
300
400
500
600
700
1999 2000 2001 2002 2003 2004 2005 2006 2007
TB
Data Revolution in Life SciencesData Revolution in Life Sciences
Impact of technologyImpact of technologyHigh throughput platforms (HTP)High throughput platforms (HTP)
– RoboticsRobotics– MiniaturisationMiniaturisation
Data driven scienceData driven scienceDatawarehousing technologiesDatawarehousing technologies
Data mining and visualisation softwareData mining and visualisation software
Life Sciences Information Technology
DatabasesDatabases
GenomicsGenomicsSangerSanger
NCBINCBI
TIGRTIGR
KEGGKEGG
TranscriptomicsTranscriptomicsArrayExpressArrayExpress
ProteomicsProteomicsProtein Databank (PDB)Protein Databank (PDB)
SWISSPROTSWISSPROT
EntrezEntrez
Using Life Sciences DataUsing Life Sciences Data
identify causes of genetic identify causes of genetic diseasesdiseasesdiscover new drug discover new drug compounds compounds personalised medicinepersonalised medicinedevelop new diagnosticsdevelop new diagnostics
Target Identification
Target Validation
HTP Screening
Hits Leads LeadsClinical Trials
FDA
Drug Discovery Pipeline
Life Sciences : The FutureLife Sciences : The Future
“…“…..biology is changing from a purely ..biology is changing from a purely laboratory-based science to an information laboratory-based science to an information based science.”based science.”
Eric Lander,Eric Lander,
Director, Whitehead Institute Director, Whitehead Institute MITMIT
AgendaAgenda
What is Life Sciences ? What is Life Sciences ?
MiMiR: database for gene expression dataMiMiR: database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics
System requirementsSystem requirements
Design issuesDesign issues
Code snippetsCode snippets
TranscriptomicsTranscriptomics
Comparing gene expression Comparing gene expression across databasesacross databasesCollaborate to share expertiseCollaborate to share expertiseBenefitsBenefits
DiagnosticsDiagnosticsScreen target drug Screen target drug compoundscompoundsIdentify toxic side effectsIdentify toxic side effectsScreen patients for clinical Screen patients for clinical trialstrials
WorkflowWorkflow
Experiment design HTP Data
Preliminary Analysis
Further Analysis
CollaborationNCBIGO
Local DB
Literature
HTP Microarray Platform : HardwareHTP Microarray Platform : Hardware
Courtesy Affymetrix Inc., Dell Inc
Microarray Data AcquisitionMicroarray Data Acquisition
Courtesy Affymetrix Inc.Courtesy Fisher Scientific
Microarray DataMicroarray Data
High density High density microarraymicroarray
~ 500,000 spots of~ 500,000 spots of
~18 ~18 µm sizeµm size
>20,000 genes>20,000 genes
Typical file size 45MBTypical file size 45MB
No. of files produced No. of files produced in typical experiment in typical experiment 10-20.10-20.
Courtesy Affymetrix Inc.
Life Sciences Data ExplosionLife Sciences Data Explosion
Data CharacteristicsData CharacteristicsImage data generated by HTP platforms, Image data generated by HTP platforms, annotation by researchersannotation by researchersLarge volume and sizeLarge volume and sizeVaried data typesVaried data types
Datawarehousing challengesDatawarehousing challengesNon-summarisableNon-summarisableHigh dimensionalityHigh dimensionalityLimited knowledge of underlying biological Limited knowledge of underlying biological processesprocessesNo standard industry data models or best practicesNo standard industry data models or best practices
AgendaAgenda
What is Life Sciences ? What is Life Sciences ?
MiMiR: database for gene expression dataMiMiR: database for gene expression dataData acquisition process and data characteristicsData acquisition process and data characteristics
System requirementsSystem requirements
Design issuesDesign issues
Code snippetsCode snippets
System RequirementsSystem Requirements
Seamless data integrationSeamless data integration
Handle wide range of datatypesHandle wide range of datatypes
Processor intensive and I/O intensiveProcessor intensive and I/O intensive
Exponential growth in data storageExponential growth in data storage
Open architecture, collaborationOpen architecture, collaboration
System RequirementsSystem Requirements
Rapid changes – new databases, Rapid changes – new databases, technologies and instrumentstechnologies and instruments
Competitive pressures, quick response, Competitive pressures, quick response, low access timeslow access times
Plug and play capabilityPlug and play capability
SecuritySecurity
MIMIcroarray Data croarray Data MIMIning ning RResourceesource
MiMiR – Microarray DatawarehouseMiMiR – Microarray Datawarehouse~250GB. Expected to double in next few ~250GB. Expected to double in next few monthsmonths
~2500 images, over 1500 BioAssays~2500 images, over 1500 BioAssays
52 tables, largest table 15GB52 tables, largest table 15GB
InfrastructureInfrastructureOracle 9i Release 1 on Windows 2000Oracle 9i Release 1 on Windows 2000
Dell PowerEdge Quad Processor, 2 GB Dell PowerEdge Quad Processor, 2 GB memory, 400 GB hard diskmemory, 400 GB hard disk
1 TB NAS capacity1 TB NAS capacity
Requirements vs. SolutionsRequirements vs. Solutions
Integrate different types of data sourcesIntegrate different types of data sourcesUse of XML for data exchangeUse of XML for data exchange
Use of Oracle UltraSearch Use of Oracle UltraSearch
Efficient data retrievalEfficient data retrievalStringent response time standards on proceduresStringent response time standards on procedures
Indexed Organised Tables, PartitioningIndexed Organised Tables, Partitioning
SecuritySecurityFirewallFirewall
Single Sign-On servers (in progress)Single Sign-On servers (in progress)
Rapid change managementRapid change managementBC4J framework, JdeveloperBC4J framework, Jdeveloper
Extreme programming, prototypingExtreme programming, prototyping
MiMiR System ArchitectureMiMiR System Architecture
MiMiR
Images
Annotation
Spot Info
Ext Ref
Blast
MAGE-ML
Application Server
JDeveloper 9iAS Admin
XSQL XSU XDK BC4J JClient
JSPArrayExpress Private
Oracle Products UsedOracle Products Used
Oracle 9i Database Server/Client (Release1)Oracle 9i Database Server/Client (Release1)PartitioningPartitioningJoin indexingJoin indexing
Oracle 9i JDeveloper (9.0.2)Oracle 9i JDeveloper (9.0.2)Oracle 9i Application Server (BC4J)Oracle 9i Application Server (BC4J)Oracle XML features Oracle XML features
Oracle PL/SQL packages for XMLOracle PL/SQL packages for XMLOracle XSQL publishing frameworkOracle XSQL publishing frameworkXDK (DOMParser and SAXParser)XDK (DOMParser and SAXParser)XSUXSU
Oracle Data Mining (Future)Oracle Data Mining (Future)Oracle Collaboration Suite (Future)Oracle Collaboration Suite (Future)
Why Oracle ?Why Oracle ?
Readily scalableReadily scalable
Manage wide variety of data typesManage wide variety of data types
Integrated development toolsIntegrated development tools
Support XML and JavaSupport XML and Java
High performance middlewareHigh performance middleware
Secure collaborationSecure collaboration
AgendaAgenda
What is Life Sciences ? What is Life Sciences ?
MiMir : database for gene expression dataMiMir : database for gene expression dataData acquisition and profilingData acquisition and profilingSystem requirementsSystem requirements
Design issuesDesign issues
Code snippetsCode snippets
Oracle and XML :Design IssuesOracle and XML :Design Issues
Storage Storage Storing XML in tablesStoring XML in tables
Storing XML in CLOBsStoring XML in CLOBs
HybridHybrid
GenerationGenerationXDK for Java, PL/SQLXDK for Java, PL/SQL
XSUXSU
TransformationTransformationXSL StylesheetXSL Stylesheet
ViewsViews
ProcessingProcessingXDK DOMParser XDK DOMParser
XDK SAXParserXDK SAXParser
SearchingSearchingXPATHXPATH
Oracle Text Oracle Text
PublishingPublishingXSQL publishing XSQL publishing frameworkframework
XSLXSL
Oracle and XML : XSQL ExampleOracle and XML : XSQL Example
<?xml version="1.0" encoding='windows-1252'?><!--| Uncomment the following processing instruction and replace| the stylesheet name to transform output of your XSQL Page using XSLT<?xml-stylesheet type="text/xsl" href="YourStylesheet.xsl" ?>--><?xml-stylesheet type="text/xsl" href="mimirArray.xsl"?><xsql:query connection="micro" xmlns:xsql="urn:oracle-xsql">select * from array</xsql:query>
Oracle and XML: Design IssuesOracle and XML: Design Issues
AgendaAgenda
What is Life Sciences ? What is Life Sciences ?
MiMir : database for gene expression dataMiMir : database for gene expression dataData profilingData profilingSystem requirementsSystem requirementsDesign issuesDesign issues
Code snippetsCode snippets
An ExampleAn Example
Creating XML from 500,000 Creating XML from 500,000 records in the databaserecords in the database
Solution 1Solution 1
Using XSU Java API to get XMLDOM.Using XSU Java API to get XMLDOM.1)1) conn=createConnection.createConnection();conn=createConnection.createConnection();
2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";"WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";
3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);
4) 4) q1.keepCursorState(true); q1.keepCursorState(true);
5) 5) XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();
6) XMLDocument.print(out);6) XMLDocument.print(out);
Solution 2Solution 2
Using XSU Java API to get Using XSU Java API to get XMLString.XMLString.1)1) conn=createConnection.createConnection();conn=createConnection.createConnection();
2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+2) String query = "SELECT * FROM IMAGE_QUANTITATION i "+ "WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";"WHERE QUANT_FILENAME = 'PMB2002011001Aaa'";
3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);3) OracleXMLQuery q1 = new OracleXMLQuery(conn,query);4) 4) q1.keepCursorState(true);q1.keepCursorState(true); 5) # XMLDocument xmlDoc=(XMLDocument)q1.getXMLDOM();6) # XMLDocument.print(out);7) 7) System.out.println(q1.getXMLString());System.out.println(q1.getXMLString());
Solution 3Solution 3Using dbms_xmlquery package to get Using dbms_xmlquery package to get XML output from SQLXML output from SQL
Select Select dbms_xmlquery.getXMLdbms_xmlquery.getXML(‘select * from (‘select * from IMAGE_QUANTITATION where IMAGE_QUANTITATION where quant_filename=‘’PMB2002011001Aaa’’’) from dualquant_filename=‘’PMB2002011001Aaa’’’) from dual
<?xml version = '1.0'?> <?xml version = '1.0'?> <ROWSET> <ROWSET> <ROW num="1"> <ROW num="1"> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <IMAGE_ID>PMB2002011003Aaa</IMAGE_ID> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <CHIP_TYPE>MG-U74Av2</CHIP_TYPE> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <ELE_SET_NAME>AFFX-MurIL2_at</ELE_SET_NAME> <POSITIVE>2</POSITIVE> <POSITIVE>2</POSITIVE> <NEGATIVE>5</NEGATIVE> <NEGATIVE>5</NEGATIVE> <PAIRS>20</PAIRS> <PAIRS>20</PAIRS> <PAIRS_USED>20</PAIRS_USED> <PAIRS_USED>20</PAIRS_USED> <PAIRS_IN_AVG>19</PAIRS_IN_AVG> <PAIRS_IN_AVG>19</PAIRS_IN_AVG>
SummarySummary
Life sciences is generating enormous Life sciences is generating enormous amount of data using HTPamount of data using HTPThe data is non-summarisable, The data is non-summarisable, distributed and has varied data typesdistributed and has varied data typesData integration and secure Data integration and secure collaboration is key to successcollaboration is key to successMiMiRMiMiR
AcknowledgementsAcknowledgements
Dr. Helen CaustonDr. Helen Causton
Prof. Tim AitmanProf. Tim Aitman
Dr. Laurence GameDr. Laurence Game
Helen BanksHelen Banks
Nicola CooleyNicola Cooley
Vihar WadekarVihar Wadekar
Helen FigueiraHelen Figueira
MGED Data Society MGED Data Society (www.mged.org)(www.mged.org)
Session : 40382
Contact: [email protected]
http://microarray.csc.mrc.ac.uk
Life Sciences: Life Sciences: Data RevolutionData Revolution
Building Gene Expression DatabasesBuilding Gene Expression Databases
What Next : What Next : Opportunities for collaboration for development Opportunities for collaboration for development of Knowledge Management Systems forof Knowledge Management Systems forDrug DiscoveryDrug Discovery