49
Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Embed Size (px)

Citation preview

Page 1: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Databases for Microarrays

Vidhya JagannathanSIB, Lausanne

Page 2: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Overview Microarray data in a nutshell Why databases? What data to represent? What is a database? Different data models E-R modelling Microarray Databases Standards being developed

Page 3: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Microarray Experiment

Page 4: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Microarray Data in a Nutshell

Lots of data to be managed before and after the experiment.

Data to be stored before the experiment . Description of the array and the sample. Direct access to all the cDNA and gene sequences,

annotations, and physical DNA resources. Data to be stored after the experiment

Raw Data - scanned images. Gene Expression Matrix - Relative expression levels observed

on various sites on the array. Hence we can see that database software capable of

dealing with larger volumes of numeric and image data is required.

Page 5: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Why Databases?

Tailored to datatype Tailored to the Scientists Intuitive ways to query the data

Diagrams, forms, point and click, text etc. Support for efficient answering of queries.

Query optimisation, indexes, compact physical storage.

Page 6: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Data Representation Goal: Represent data in an intuitive and

convenient manner Without unnecessary replication of information Making it easy to write queries to find required

information Supporting efficient retrieval of required information

Page 7: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

What is a Database? A database is an organised collection of pieces of

structured electronic information. Example 1: Libraires use a database system to keep track of

library inventory and loans. Example 2: All airlines use database system to manage their

flights and reservations. The collection of records kept for a common purpose such as

these is known as a database. The records of the database normally reside on a

hard disk and the records are retrieved into computer memory only when they are accessed.

So the reasons are obvious why we need to discuss about a Microarray database.

Page 8: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Data Models Describes a container for data and

methods to store and retrieve data from that container.

Abstract math algorithms and concepts. Cannot touch a data model. Very useful

Page 9: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Types of Data Models

Ad-hoc file formats (not really data models!) Relational data model Object-relational data model Object-oriented data model XML (Extensible Markup Language)

Page 10: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Ad-hoc File Formats The various 'ad-hoc' file formats in use for

microarray data are: Flat file formats. Spread sheet formats. Not the least - Even MS-Word documents !!!

Very rudimentary method to store data . Sometimes contains redundant information. Extremely inefficient for retrieval of particular

subsets of the results.

Page 11: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Relational Data Model Most prevalent and used in many databases

developed today.

The collection of related information is represented as a set of tables.

Data value is stored in the intersection of row and column

Column values are of the same kind. A Simple data validation.

Rows are unique. So no data redundancy and every row is meaningful and can be identified by the unique key.

Utilises Structured Query Language (SQL) for data storage, retrival and manipulation.

Page 12: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

CONTIG_STRANDCONTIG_ENDCONTIG_STARTCONTIG_IDGENE_ID

Complement 23210722308745NT_0106051.3GB2VN32

Complement23607782354807 NT_0106058.3GB2VN

GENE

Example

Table

Row or Record

Field or Column

Page 13: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Example

CONTIG_STRANDCONTIG_ENDCONTIG_BEGINCONTIG_IDGENE_ID

Complement 23210722308745NT_0106051.3GB2VN32

Complement23607782354807 NT_0106058.3GB2VN

GENE

Molecular function

TYPEDESCRIPTIONGENE_IDCLASS_ID

DNA bindingGB2VN32GO:0003677

GeneMSH(Drosophila)GB2VNMSX2

CLASSIFICATION

Page 14: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Advantages of Relational Model Allows information to be broken up into logical

units and stored in tables. Allows combining data from different tables in

different ways to derive useful information. Great for queries involving information from

multiple original sources. Can easily gather related information.

e.g. information about a particular gene from multiple datasets/experiments

Page 15: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Object Oriented Model Object Oriented Model allows real world data

to be represented as objects. Objects encapsulate the data and provide

methods to access or manipulate it. Objects with specific structure and set of

methods are said to belong to the object class.

Allows new classes to be created by extending the description of the parent class.

Child classes inherit the data and methods of the parent class.

Page 16: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Example

DNA Protein

InheritsInherits

ProteinProteinDNADNA

get_bio_seq()Biomolecule

. String seq

OODBMS

Page 17: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Example - ArrayExpress

External links

ArrayExpress

Experiment

e.g., publication, webresource

Referencee.g., organism

taxonomy

Ontology

Sample Array

e.g., gene inSWISS-prot

Database

Hybridization

ExpressionValue

Page 18: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Database Design Entity-Relationship Concept

Entity ARelationship

Entity B

Examples

Page 19: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Entities are real world objects

ex: gene contain attributes

ex: gene_id, sequence are drawn as rectangle boxes that holds the

name of the entity and attribute in two different notations as there is no standard!

Gene

gene_idsequence

Genegene_idsequence

notation 1 notation 2

Page 20: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Relationship Relationships provide connections between two or more

entities ex: Which genes were used in which experiment

When two entities are involved in a relationship, it is known as binary relationship.

When three entities are invoved in a relationship, it is called as ternary relationship.

When more than three entities are involved in a relationship, it is usually broken in to one or more binary or ternary relationships.

are drawn as a line linking the involved entities as:

Gene Experimentused_in

Page 21: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Experiment Experiment-Id Date Image

Experimenter Experimenter-Id Name E-mail Dept. Institution

Sample Sample-Id Organism Cell-type {Drug-Ids}

Array Array-Id Manufacturer Type Batch

Gene gene-id sequence Expt-Exptr

Expt-Sample

Expt-Array

Expression-valuevalue

* 1

Many-to-one

Notation

Multivalued attribute

Example E-R Diagram

Page 22: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Object relational data model

Improved relational model by adding some features from object data models.

Information is represented as in relational models but column values not restricted to one mutliple values are allowed.

Example (sample table in previous slides):

sampe-id organism celltype drug_id

d1 d2

s001 ecoli c123 ac1 nm

ac3 nms002 ecoli c123

Page 23: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Queries, queries, queries!! Given a collection of microarray

generated gene expression data, what kind of questions the users wish to pose.

Constructing an extensive list of possible interesting queries and data mining problems that has to be supported by the database will facilitate the design process.

Page 24: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Query to the data Which genes are linked ? Which genes are expressed similarly to my gene

XYZ? Which genes have a changed the expression in a

second condition ? Which genes are co-expressed in differing

conditions ? classification (of tumors, diseased tissues etc.):

which patterns are characteristic for a certain class of samples, which genes are involved?

Queries, queries, queries!!

Page 25: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

More Queries !!! Queries that add a link in additional

knowledge functional classification of genes: Are changes

clustered in particular classes? metabolic pathway information: Is a certain

pathway/route in a pathway affected? disease information & clinical follow up: correlation

to expression patterns. phenotype information for mutants: Are there

correlations between particular phenotypes and expression patterns?

Page 26: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

More Queries !!! in what region is the interesting gene located

in the genome? is there synteny in this region with other

species? is there a known trait that maps to this

region?

Page 27: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Query Language Language in which user requests information

from the database. SQL

• Data definition helps you implement your model and data manipulation helps you modify and retrive data

Advantages: • Can specify query declaratively and let database

system figure out best way of finding answers• Supports queries of medium complexity• Specialized languages • SQL language statements are not abstract but

very close to spoken language.

Page 28: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Basic SQL Queries Find the image for experiment number 1345

select image from experimentwhere experiment-id = 1345;

Find the experiment-id and image of all experiments involving e-coli select experiment-id, image from experiment,

sample where experiment.sample-id = sample.sample-id and sample.organism = `e.coli‘;

All combinations of rows from the relations in the from clause are considered, and those that satisfy the where conditions are output

Page 29: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Interfacing

SQL queries are carried out on terminal screen which is not very useful and user friendly for an end user, so applications are created to interface more friendly staments with the SQL statements A web form is a typical example of interface for SQL Applications for data loading.

More complex queries (e.g. data mining such as classification and clustering) are very imporatant part of the Microarray Analyis Protocol

It is very important to interface the various applications we use to analyse the retrieved data with database.

Page 30: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Page 31: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Gene Expression Databases Require Integration

There are many different types of data presenting numerous relationships.

There are a number of Databases with lots of information.

Experiments need to be compared because the experiments are very difficult to perform and very expensive.

Solution: Make all the databases talk the same language.

XML was the choice of data interchange format.

Page 32: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Why XML? Why XML ?: XML provides the method for defining the

meaning or semantics of data. Example : A XML file of the earlier table we defined

<gene_features> <gene_id>GBVN32</gene_id> <contig_id>NT_010651</contig_id> <contig_start>2354807</contig_start> <contig_end>2360778</contig_end> <contig_strand>Complement</contig_strand></gene_features>

Page 33: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Mapping XML to Relational Database The Data Structure in XML is defined in Document Type

Descrciptor as follows <!ELEMENT gene_id (#PCDATA)><!ELEMENT contig_id (#PCDATA)><!ELEMENT contig_start (#PCDATA)><!ELEMENT contig_end (#PCDATA)><!ELEMENT contig_sequence (#PCDATA)>

This kind of DTD also helps us to have control over the vocabulary used.

SQL: create table gene ( gene_id varchar(5) primary key, contig_id varchar(10) not null,contig_start integer not null, contig_end integer not null, contig_sequence text not null);

So the DTD can be directly mapped into a relational database.

Page 34: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

MAGE-ML As Data Interchage Format

Databases

Expression Data

Converter (program)

MAGE-ML

Page 35: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Existing Microarray Databases Several gene expression databases exist:Both

commercial and non-commercial. Most focus on either a particular technolgy or a

particular organism or both. Commercial databases:

Rosetta Inpharmatics and Genelogic, the specifics of their internal structure is not available for internal scrutiny due to their proprietary nature.

Some non-commercial efforts to design more general databases merit particular mention.

We will discuss few of the most promising ones ArrayExpress - EBI The Gene expression Omnibus (GEO) - NLM The Standford microarray Database ExpressDB - Harvard Genex - NCGR

Page 36: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Page 37: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

ArrayExpress Public repository of microarray based gene

expression data. Implemented in Oracle at EBI. Contains:

several curated gene expression datasets possible introduction of an image server to archive raw

image data associated with the experiments. Accepts submissions in MAGE-ML format via a web-

based data annotation/submission tool called MIAMExpress. A demo version of MIAMExpress is available at:

http://industry.ebi.ac.uk/~parkinso/subtool/subtype.html Provides a simple web-based query interface and is directly

linked to the Expression Profiler data analysis tool which allows expression data clustering and other types of data exploration directly through the web.

Page 38: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Gene Express Omnibus The Gene Expression Omnibus ia a gene expression

database hosted at the National library of Medicine It supports four basic data elements

Platform ( the physical reagents used to generate the data) Sample (information about the mRNA being used) Submitter ( the person and organisation submitting the data) Series ( the relationship among the samples).

It allows download of entire datasets, it has not ability to query the relationships

Data are entered as tab delimited ASCII records,with a number of columns that depend on the kind of array selected.

Supports Serial Analysis of Gene Expression (SAGE) data.

Page 39: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Stanford Microarray Database Contains the largest amount of data. Uses relational database to answer queries. Associated with numerious clustering and

analysis features. Users can access the data in SMD from the

web interface of the package. Disadvantage :

It supports only Cy3/Cy5 glass slide data It is designed to exclusively use an oracle

database Has been recently released outside without

anykind of support !!

Page 40: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

MaxdSQL Minor changes to the ArrayExpress object data

model allowed it to be instantiated as a relational database, and MaxdSQL is the resulting implementation.

MaxdSQL supports both Spotted and Affymetrix data and not SAGE data.

MaxdSQL is associated with the maxdView, a java suite of analysis and visualisation tools.This tool also provides an environment for developing tools and intergrating existing software.

MaxdLoad is the data-loading application software.

Page 41: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Page 42: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

GeneX Open source database and integrated tool set

released by NCGR http://www.ncgr.org. Open source - provides a basic infrastructure upon

which others can build. Stores numeric values for a spot measurement

(primary or raw data), ratio and averaged data across array measurements.

Includes a web interface to the database that allow users to retrieve: Entire datasets, subsets Guided queries for processing by a particular analysis routine Download data in both tab delimited form and GeneXML

format ( more descriptions later)

Page 43: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

Page 44: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

ExpressDB ExpressDB is a relational database containing yeast

and E.coli RNA expression data. It has been conceived as an example on how to

manage that kind of data. It allows web-querying or SQL-querying. It is linked to an integrated database for functional

genomics called Biomolecule Interaction Growth and Expression Database (BIGED).

BIGED is intended to support and integrate RNA expression data with other kinds of functional genomics data

Page 45: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

This survey is based on the article published in

BRIEFINGS IN BIOINFORMATICS, Vol 2, No 2, pp 143-158, May 2001:A comparison of microarray databases

Survey of existing microarray systems

Page 46: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

The Microarray Gene Expression Database Group (MGED)History and Future: Founded at a meeting in November, 1999 in Cambridge, UK. In May 2000 and March 2001: development of recommendations

for microarray data annotations (MAIME, MAML). MGED 2nd meeting:

establishment of a steering committee consisting of representatives of many of the worlds leading microarray laboratories and companies

MGED 4th meeting in 2002: MAIME 1.0 will be published MAML/GEML and object models will be accepted by the OMG concrete ontology and data normalization recommendations will be

published. information can be obtained from

http://www.mged.org

Page 47: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

The Microarray Gene Expression Database Group (MGED)

Goals:

Facilitate the adoption of standards for DNA-array experiment annotation and data representation.

Introduce standard experimental controls and data normalization methods.

Establish gene expression data repositories.

Allow comparision of gene expression data from different sources.

Page 48: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

MGED Working Groups

Goals: MIAME: Experiment description and data representation

standards - Alvis Brazma

MAGE: Introduce standard experimental controls and data normalization methods - Paul Spellman. This group includes the MAGE-OM and MAGE-ML development.

OWG: Microarray data standards, annotations, ontologies and databases - Chris Stoeckert

NWG: Standards for normalization of microarray data and cross-platform comparison - Gavin Sherlock

Page 49: Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

Vidhya Jagannathan, SIB, Lausanne

References URL:

Tutorial on Information Management for Genome Level Bioinformatics, Paton and Goble, at VLDB 2001: http://www.dia.uniroma3.it/~vldbproc/#tutEuropea

European Molecular Biology Network http://www.embnet.org/

Univ. Manchester site (with relational version of Microarray data representation, and links to other sites)• http://www.bioinf.man.ac.uk

Database textbook with absolutely no bioinformatics coverage

For Microarray Data http://linkage.rockefeller.edu/wli/microarray/