The Bioinformatics of Microarrays Microarray Outreach Team Fall 2005

The Bioinformatics of Microarrays

Microarray Outreach Team Fall 2005

Outline

• Biology, Statistics, Data mining common term definitions

• Transcriptome caveats and limitations• Experimental Design• Scan to intensity measures• Low level analysis• Data mining – how to interpret > 6000 measures

– Databases– Software– Techniques– Comparing to prior HT studies, across platforms?

Issues

Bioinformatics, Computational Biology, Data Mining

• Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems.

• Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g.– Genomes (viruses, bacteria, fungi, plants, insects,…)– Proteins and Proteomes– Biological Sequences– Molecular Function and Structure

• Data Mining is searching for knowledge in data– Knowledge mining from databases– Knowledge extraction– Data/pattern analysis– Data dredging– Knowledge Discovery in Databases (KDD)

Basic Terms in Biology

Example:• The human body contains ~100 trillion cells• Inside each cell is a nucleus• Inside the nucleus are two complete sets of the human

genome (except in egg, sperm cells and blood cells)• Each set of genomes includes 30,000-80,000 genes on the

same 23 chromosomes• Gene – A functional hereditary unit that occupies a fixed

location on a chromosome, has a specific influence on phenotype, and is capable of mutation.

• Chromosome – A DNA containing linear body of the cell nuclei responsible for determination and transmission of hereditary characteristics

Basic Terms in Data Mining

• Data Mining:A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.

• Knowledge Discovery Process: The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.

• A pattern is a conservative statement about a probability distribution. – Webster: A pattern is (a) a natural or chance configuration, (b) a

reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution

Problems in Bioinformatics Domain

– Data production at the levels of molecules, cells, organs, organisms, populations

– Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, …

– Prediction of Molecular Function and Structure

– Computational biology: synthesis (simulations) and analysis (machine learning)

Subcellular Localization, Provides a simple goal for genome-scale functional prediction

Determine how many of the ~6000 yeast proteins go into each compartment

Subcellular Localization, a standardized aspect of function

Nucleus

Membrane

Extra-cellular[secreted]

ER

Cytoplasm

Mitochondria

Golgi

"Traditionally" subcellular localization is "predicted" by sequence patterns

NLS

TM-helix

Sig. Seq.

HDEL

Nucleus

Membrane


ER

Cytoplasm

Mitochondria

Golgi Import Sig.

Subcellular localization is associated with the level of gene expression

Nucleus

Membrane


ER

Cytoplasm

Mitochondria

Golgi

[Expression Level in Copies/Cell]

Combine Expression Information & Sequence Patterns to Predict Localization

NLS

TM-helix

Sig. Seq.

HDEL

Nucleus

Membrane


ER

Cytoplasm

Mitochondria

Golgi Import Sig.

[Expression Level in Copies/Cell]

Major Objective: Discover a comprehensive theory of life’s organization at the molecular level– The major actors of molecular biology: the

nucleic acids, DeoxyriboNucleic acid (DNA) and RiboNucleic Acids (RNA)

– The central dogma of molecular biology???

Proteins are very complicated molecules with 20 different amino acids.

Dynamic Nature of Yeast Genome

eORF= essential

kORF= known

hORF= homology identified

shORF= short

tORF= transposon identified

qORF= questionable

dORF= disabled

First published sequence claimed 6274 genes– a # that has been revised many times, why?

The Affy detection oligonucleotide sequences are frozen at the time of synthesis, how does this impact downstream data analysis?

1. Experimental Design

2. Image Analysis – raw data

3. Normalization – “clean” data

4. Data Filtering – informative data

5. Model building

6. Data Mining (clustering, pattern recognition, et al)

7. Validation

Microarray Data Process

Experimental Design

A good microarray design has 4 elements1. A clearly defined biological question or hypothesis

2. Treatment, perturbation and observation of biological materials should minimize systematic bias

3. Simple and statistically sound arrangement that minimizes cost and gains maximal information

4. Compliance with MIAME

• The goal of statistics is to find signals in a sea of noise• The goal of exp. design is to reduce the noise so signals

can be found with as small a sample size as possible

Observational Study vs. Designed Experiment

• Observational study-– Investigator is a passive observer who

measures variables of interest, but does not attempt to influence the responses

• Designed Experiment-– Investigator intervenes in natural course of

events

What type is our DMSO exp?

Experimental Replicates

• Why?– In any exp. system there is a certain amount of noise—

so even 2 identical processes yield slightly different results

– Sources?– In order to understand how much variation there is it is

necessary to repeat an exp a # of independent times – Replicates allow us to use statistical tests to ascertain if

the differences we see are real

Technical vs. Biological Replicates

As we progress from the starting material to the scanned image we are moving from a system dominated by biological effects through one dominated by chemistry and physics noise

Within Affy platform the dominant variation is usually of a biological nature thus best strategy is to produce replicates as high up the experimental tree as possible

From probe level signals to gene abundance estimates

From probe level signals to gene abundance estimates

The job of the expression summary algorithm is to take a set of Perfect Match (PM) and Mis-Match (MM) probes, and use these to generate a single value representing the estimated amount of transcript in solution, as measured by that probeset.

To do this, .DAT files containing array images are first processed to produce a .CEL file, which contains measured intensities for each probe on the array.

It is the .CEL files that are analysed by the expression calling algorithm.

http://bioinformatics.picr.man.ac.uk/mbcf/example_ma.jsp

PM and MM Probes

• The purpose of each MM probe is to provide a direct measure of background and stray-signal(perhaps due to cross-hybridisation) for its perfect-match partner. In most situations the signal from each probepair is simply the difference PM - MM.

• For some probepairs, however, the MM signal is greater than the PM value; we have an apparently impossible measure of background.

Signal Intensity

• Following these calculations, the MAS5 algorithm now has a measure of the signal for each probe in a probeset.

• Other algortihms, ex RMA, GCRMA, dCHIP and others have been developed by academic teams to improve the precision and accuracy of this calculation

• In our Exp we will use RMA and GCRMA

Low level data analysis / pre-processing

GeneSpring, R-language, Bioconductorb

• Varying biological or cellular composition among sample types.

• Differences in sample preparation, labeling or hybridization• Non specific cross-hybridization of target to probes.

Lead to systemic differences between individual arrays

• Raw Data Quality Control

• Scaling

• Normalization and filtering.

Scott

Scott

GMC scientists

Anjie

Anjie

GMC scientists + entire UVM outreach team

Data processing is completed now what?

Overview of Microarray Problem

Data Mining

Microarray Experiment

Image Analysis

Biology Application Domain

Experiment Design and Hypothesis

Data Analysis

Artificial Intelligence (AI)

Knowledge discovery in databases (KDD)

Data Warehouse

Validation

Statistics

Back to Biology

• Do the changes you see in gene expression make sense BIOLOGICALLY?

• How do we know?

• If they don’t make sense, can you hypothesize as to why those genes might be changing?

• Leads to many, many more experiments

A Common Language for Annotation of Genes from

Yeast, Flies and Mice

The Gene Ontologies

…and Plants and Worms

…and Humans

…and anything else!

Gene Ontology Objectives• GO represents concepts used to classify

specific parts of our biological knowledge:– Biological Process– Molecular Function– Cellular Component

• GO develops a common language applicable to any organism

• GO terms can be used to annotate gene products from any species, allowing comparison of information across species

Sriniga Srinivasan, Chief Ontologist, Yahoo!

The ontology. Dividing human knowledge into a clean set of categories is a lot like trying to figure out where to find that suspenseful black comedy at your corner video store. Questions inevitably come up, like are Movies part of Art or Entertainment? (Yahoo! lists them under the latter.) -Wired Magazine, May 1996

• Molecular Function = elemental activity/task– the tasks performed by individual gene products; examples are carbohydrate

binding and ATPase activity

• Biological Process = biological goal or objective

– broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions

• Cellular Component = location or complex– subcellular structures, locations, and macromolecular complexes; examples

include nucleus, telomere, and RNA polymerase II holoenzyme

The 3 Gene Ontologies

Function (what) Process (why)

Drive nail (into wood) Carpentry

Drive stake (into soil) Gardening

Smash roach Pest Control

Clown’s juggling object Entertainment

Example: Gene Product = hammer

Biological ExamplesMolecular FunctionMolecular FunctionBiological ProcessBiological Process Cellular ComponentCellular Component

term: MAPKKK cascade (mating sensu Saccharomyces)

goid: GO:0007244

definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces.

definition_reference: PMID:9561267

comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO:0000750'.

Terms, Definitions, IDs

definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces

SGD

SGD public microarray data sets available for public query

Homework1. Go to http://www.yeastgenome.org/ and find 3 candidate genes of known

f(x) and one of undefined f(x) that you might predict to be altered by DMSO treatment

2. What GO biological processes and molecular mechanisms are associated with your candidate genes?

3. Where, subcellularly does the protein reside in the cell?4. What other proteins are known or inferred to interact with yours? How

was this interaction determined? Is this a genetic or physical interaction?5. Find the expression of at least one of your known genes in another public

ally deposited microarray data set?1. Name of data set and how you found it?2. What is the largest Fold change observed for this gene in the public study?

6. Now that you are microarray technology experts can you give me 3 reasons why the observed transcript level difference may not be confirmed through a second technology like RTQPCR?

http://www.yeastgenome.org/

Suggested Reading

Documents

The Bioinformatics of Microarrays Microarray Outreach Team Fall 2005