32
Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Embed Size (px)

Citation preview

Page 1: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Microrray Data Standardisation

Microarray Gene Expression Database group -- MGED

December, 2000

Page 2: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Public data repositories for microarray data

There is a growing consensus in the life science community for a need for public repositories of gene expression data analogous to DDBJ/EMBL/GenBank for sequences

Page 3: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Some of the reasons:• Gradually building up gene expression profiles for various

organisms, tissues, cell types, developmental stages, various states, under influence of various compounds

• Through links to other genomics databases builds up systematic knowledge about gene functions and networks

• Comparison of profiles, access and analysis of data by third parties

• Cross validation of results and platforms - quality control

Page 4: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Systematic gene expression profiling initiatives in public domain

The International Life Science Institute (ILSI) is coordinating a program undertaken by ~25 pharmaceutical and food companies to generate toxicity related gene expression data under defined experimental conditions– evaluate gene expression profiles in standardised

test systems following exposure to toxicants– relate changes in gene expression to other measures

of toxicity

Page 5: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Microarray data handling and analysis - a major bottleneck (Calculations by Jerry

Lanfear)

• Experiments:– 100 000 genes in human

– 320 cell types

– 2000 compounds

– 3 time points

– 2 concentrations

– 2 replicates

• Data– 8 x 1011 data-points– 1 x 1015 = 1 petaB of data

Page 6: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Expression data repository projects

• Public repositories in making:– GEO - NCBI

– GeneX - NCGR

– ArrayExpress - EBI

• In-house databases - Stanford, MIT, University of Pennsylvania,

• Organism specific databases: Mouse in Jackson• Proprietary databases - Gene Logic, NCI

Page 7: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Difficulties

• Raw data are images

• What is needed for higher level analysis and mining is gene expression matrix (genes/samples/gene expression levels)– lack of standard measurement units for gene

expression– lack of standards for sample annoation

Page 8: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Raw data - images

Treated sample labeled red (Cy5)Control data labeled green (Cy3)

Competitive hybridization onto chip

Red dot - gene overexpressed in treated sampleGreen dot - gene underexpressed in treated sampleYellow - equally expressedIntensity - “absolute” level

red/green - ratio of expression2 - 2x overexpressed0.5 - 2x underexpressed

log2( red/green ) - “log ratio” 1 2x overexpressed-1 2x underexpressed

cDNA plotted microarrayStanford university (Yeast,1997)

Page 9: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Gene expression matrix

SamplesG

enes

Gene expression levels

Page 10: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Gene expression levels

• What we would like to have– gene expression levels expressed in some

standard units (e.g. molecules per cell)– reliability measure associated with each value

(e.g. standard deviation)

• What we do have– each experiment using different units– no reliability information

Page 11: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Comparing expression data

cm inc

Page 12: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Comparing expression data

? ?

Page 13: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Comparing expression data

Page 14: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Measurement units

• In perspective:– standard controls for experiments (on chips and

in the samples)– replicate measurements

• Temporary solution:– storing intermediate analysis results (including

the images) and annotations of how they were obtained - i.e., the evidence

Page 15: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Comparing expression data - problem 2

• How gene names relate in different data matrices?

• How samples relate in different data matrices?

Page 16: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Sample annotation• Gene expression data have any meaning only in

the context of what are the experimental conditions of the target system

• Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample annotation

• Sample annotations in current public databases are typically useless

Page 17: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

In perspective

• Standard units for gene expression measurements

• Standards for sample annotation.

Page 18: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

More immediate actions• To understand what information about

microarray experiments should be captured to make the descriptions reasonably self-contained

• Develop data exchange format able to capture this minimum information

• Develop recommendations how data should be normalised and what controls should be used

Page 19: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MGED groupThe MGED group is an open discussion group initially established at the Microarray Gene Expression Database meeting MGED 1 (14-15 November, 1999, Cambridge, UK). The goal of the group is to facilitate the adoption of standards for DNA-array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalisation methods. The underlying goal is to facilitate the establishing of gene expression data repositories, comparability of gene expression data from different sources and interoperability of different gene expression databases and data analysis software. Since 1999 the group has had two general meetings and the third one is planned for 2001

For more see www.mged.org

Page 20: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MGED participants including

• Affymetrix• Berkeley• DDBJ • DKFZ• EMBL• Gene Logic• Incyte• Max Plank Institute

• NCBI• NCGR• NHGRI• Sanger Centre• Stanford• Uni Pennsylvania• Uni Washington• Whitehead Institute

Page 21: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Working groups• Microarray experiment annotations and minimum

information standards (A. Brazma)• XML-data communication standards and

interfaces (P. Spellman)• Ontology for sample description (M. Bittner)• Cross platform comparison and normalisation

(F.Holstege, R.Bumgarner)• Future user group - queries, query languages and

data mining (M. Vingron)

Page 22: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MGED state of art

• Formulation of the “minimum information about a microarray experiment” (MIAME) to ensure its interpretability and reproducibility

• Data exchange format based on XML - microarray markup language (MAML) submitted to OMG in November

Page 23: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MIAME six parts: 1. Experimental design: the set of the hybridisation experiments as a whole 2. Array design: each array used and each element (spot) on the array3. Samples: samples used, the extract preparation and labeling4. Hybridizations: procedures and parameters5. Measurements: images, quantitation, specifications6. Controls: types, values, specifications

see www.mged.org for details

Page 24: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MIAME concepts• MIAME is aimed at co-operative data submitter

• Concept of “qualifier, value, source” lists, where source is either user defined or an external reference

• Reusable information can be referenced, but should be provided at least once (array descriptions, standard protocols)

• Raw data should be reported, together with the authors interpretations

Page 25: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MAML

• MAML is an XML based data exchange format able to capture MIAME compliant information

• The work is still in progress, the first draft has been submitted to OMG as a data exchange standard for microarray data

Page 26: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MAML concepts

• Annotations + data; data can be given as a set of external 2D matrices

• Data format independent on particular scanner or image analysis sofwater

• Sample and treatment can be represented as a DAG

• Concept of composite images and composite spots

Page 27: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Sample and treatment representation

Sample 1 Sample 2 Sample 3

Array 1 Array 2

Treatments

Page 28: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Expression matrix - raw and processed

Samples

Gen

es

Gene expression levels

Images

Spo

ts

Spot/Image quantiations

Page 29: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Microarray image analysis data representation

Images

Spots Quantitations

primary images composite imagese.g., green/red ratios

primaryspots

compositespots

Page 30: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MAML future

• The NOMAD microarray LIMS system will export data in MAML format

• ArrayExpress and GEO will import data in MAML format

• We hope that OMG will accept MAML as the industry standard

• We hope that MAML will become a defacto standard

Page 31: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

MGED steering committee

• Meeting in Bethesda on 17 Nov 2000

• MIAME accepted and a publication urging the journals and funding agencies to adopt it will be prepared

• MGED will become ISCB Special Interest Group

• Next general MGED meeting in Stanford, March 29-31

Page 32: Microrray Data Standardisation Microarray Gene Expression Database group -- MGED December, 2000

Top level object model for gene expression database

External links

ArrayExpress

Experiment

e.g., publication, webresource

Referencee.g., organism

taxonomy

Ontology

Sample Array

e.g., gene inSWISS-prot

Database

Hybridization

ExpressionValue