High-throughput omic datasets and clustering Colin Dewey BMI/CS 576 [email protected] Fall 2015

Clustering approaches for high-throughput omic datasets

High-throughput omic datasets and clusteringColin DeweyBMI/CS 576www.biostat.wisc.edu/[email protected] 2015

Goals for todayRecap of molecules of lifeHigh-throughput datasets/omic datasetsClustering approaches to analyzing high-throughput datasetsWhat is clustering?Applications to mRNA/gene expression levels

Molecules of lifeDNARNAmRNAncRNAProteinsMetabolitesWhile DNA is mostly static, epigenome, RNA, proteins, metabolites change between cell types, tissues, environments and conditions

The central dogmaHigh-throughput datasets and omesAim to measure as many components of a cells simultaneouslyTypes of omesGenome: collection of DNA in a cellEpigenome: all of the chemical modifications on the genomeTranscriptome: all of the RNA in cellProteome: all of the proteins in a cellMetabolome: all of the metabolites present in a cellInteractome: all of the interactions within a cellOmics data provide comprehensive description of nearly all components of the cellJoyce & Palsson, Nature Mol cell biol. 2006

Databases with omic data

Joyce & Palsson, Nature Mol cell biol. 2006Understand a cell as a systemMeasure: identify the parts of a systemParts: different types of bio-molecules genes, proteins, metabolitesHigh-throughput assays to measure these moleculesModel: how these parts are put togetherClusteringNetwork inference and analysis

Transcriptomes/genome-wide mRNA levels are dynamic

What is varied: individuals, strains, cell types, environmental conditions, disease states, etc.What is measured: RNA quantities for thousands of genes, exons or other transcribed sequencesgenesmRNAsBio-techniques to measure transcriptomesMicroarrayscDNA/spotted arraysOligonucleotides arraysSequencingRNA-seqMicroarraysA microarray is a solid support, on which pieces of DNA are arranged in a grid-like arrayEach piece is called a probeMeasures RNA abundances by exploiting complementary hybridizationDNA from labeled sample is called target

Microarrays

Complementary hybridizationAGCGGTTCGAATACCTCGCGAAGCTAGACACCGAAATAGCCAGTAUCGCCAAGCUUAUGGDue to Watson-Crick base pairing, complementary single-stranded DNA/RNA molecules hybridize (bond to each other)Complementary hybridizationAGCGGTTCGAATACCone way to do it in practiceput (a large part of ) the actual gene DNA sequence on arrayconvert mRNA to cDNA using reverse transcriptaseUCGCCAAGCUUAUGGTCGCCAAGCTTATGGactual genecDNAmRNAreverse transcriptaseSpotted vs. oligonucleotide arraysspotted arrays: synthesize samples of cDNA (full-length transcripts or shorter sequences) and then spot them onto arrayUsually two colorsoligonucleotide arrays: synthesize sets of DNA oligonucleotides (short, fixed length sequences, typically 25-60 nucleotides in length) on array itself (in situ)Commercially availableAffymetrixNimblegenUsually one colorIn both cases, mRNA is converted to DNA, labeled and hybridized, and detected by fluorescence scanningSpotted versus oligonucleotide array

From Vermeeren and Luc Michiels 2011 DOI: 10.5772/19432A video about two color DNA microarrays

Microarray measurementsCant detect the absolute amount of mRNA present for a given gene, but can measure a relative quantityFor two color arrays, the measurements represent

where red is the test expression level, and green is the reference level for gene G in the i th experiment

RNA-seq measurementsMeasurements are digital: counts of sequenced reads for each gene/transcript

Still the measurements represent relative amounts of each transcript: the counts depend on how many reads are sequencedWang et al, Nature Genetics 2009

A typical RNA-seq pipelineRNA-seq vs. microarraysAdvantages of RNA-seqDont need reference sequence for genes/genome being assayedLow background noiseLarge dynamic range (105 vs. 102 for microarrays)High technical reproducibility

Disadvantagemore expensive, but cost is rapidly fallingGene expression profilesWe will assume we have a 2D matrix of gene expression measurementsrows represent genescolumns represent different experiments, time points, individuals etc.

We will refer to individual rows or columns as profilesa row is a profile for a genea column is a profile for an experiment, time point, etc.

Gene-expression profiles for yeast cell cycleRows represent yeast genesColumns represent time points as yeast goes through cell cycleColor represents expression level relative to baseline (red=high, green=low, black=baseline)

Spellman 1998Gene-expression profiles for leukemia patients

rows represent genescolumns represent people with 2 subtypes of leukemia: ALL and AMLEach column corresponds to a microarray measurementGene-expression profiles for genes that induce differentiationIvanova et al., Nature 2006

Commonly asked questions from expression datasets If we measure gene expression in a normal versus disease cell type, which genes have different expression levels across two groups?Differential expressionWhich genes seem to be changing together?Clustering genes based on expression profiles of genes across all conditionsWhich treatments/individuals have similar profiles?Clustering samples based on gene expression profiles of all genesWhat does a gene do?What functional class does a given gene belongWhat class is a sample from?e.g., does this patient have ALL or AMLHow will this sample react to a particular drug?Clustering of gene expression dataTask definitionDistance metricFlat clusteringK-meansModel-based clusteringGaussian mixture modelsHierarchical clusteringTop-down and bottom-up

Task definition: clustering gene expression profilesGiven: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent)

Do: organize profiles into clusters such thatprofiles in the same cluster are highly similar to each otherprofiles from different clusters have low similarity to each other

Motivation for clusteringExploratory data analysisunderstanding general characteristics of datavisualizing data

Generalizationinfer something about an object (e.g. a gene) based on how it relates to other objects in the cluster

Example output of clustering

ClustersGenesGenesGenesGasch & Eisen, 2002Environment conditonsThe clustering landscapeThere are many different clustering algorithms

They differ with respect to several propertieshierarchical vs. flathard (no uncertainty about which profiles belong to a cluster) vs. soft clustersoverlapping (a profile can belong to multiple clusters) vs. non-overlappingdeterministic (same clusters produced every time for a given data set) vs. stochastic distance (similarity) measure usedDistance measuresCentral to all clustering algorithms is a measure of distance between objects being clusteredClustering algorithms aim to group similar things togetherOptimize some measure of within cluster similarityDefining the right similarity or distance is an important factor in getting good clustersMost algorithms will work with symmetric dissimilaritiesDissimilarities may not be distances

Notation for microarray dataIs a tall matrix of numbers with rows corresponding to genes and columns corresponding to conditions1.2 1.4 0.8 -2.3 .....

n Genesp microarrays/samplesxiData matrix:Expression profile of ith genejth microarrayxijExpression value of ith gene in jth microarrayDifferent dissimilarity measuresEuclidean distance between two vectors xi and xk

Manhattan distance

Distance MetricsProperties of a metric or distance function

(non-negativity)(identity)(symmetry)(triangle inequality)SummaryHigh-throughput biological measurement technology is enabling comprehensive molecular profiling of cellsOften faced with large matrices of such profiling dataRows = biomolecular entities (e.g., genes)Columns = samples (e.g., patients or timepoints)The clustering task is a first step towards understanding such dataCore to clustering is the definition of similarity or distance

Documents

High-throughput omic datasets and clustering Colin Dewey BMI/CS 576 [email protected] Fall 2015