Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE

1 of 28

2013-09-03, Nikolas Pontikos, Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE

Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE

Nikolas Pontikos

PhD Student, CIMR

2 of 28


Different types of cells can be identified based on their shape/size and the surface markers (proteins) that they express:

Biological Context: Cell Phenotypes

Lymphocytes Granulocytes Neutrophils

CD4+ Lymphocytes CD8+ Lymphocytes

CD45RA+ CD45RA-CD stands for Cluster of Differentiation these are surface proteins which can be used as markers to distinguish different cell types.

3 of 28


0 1000 2000 3000 4000

020

040

060

080

010

00

Forward Scatter

Side

Sca

tter

What is Flow Cytometry?

1998-2012 Abcam plc. All rights reserved

Cells ForwardScatter

SideScatter

CD4 CD127 CD45RA CD25

1 2110 309 103 254 4 70

2 1565 252 57 278 341 59

... ... ... ... ... ... ...

110,992 964 256 78 199 9 345

110,992 points

Granularity

Lymphocytes

Cell Size

Neutrophils

Granulocytes

4 of 28


The Transitional Phenotype of Cells

Memory Cell Naive Cell

CD45RA

Memory Cells Naive Cells

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Log10 CD45RA IntensityDe

nsity

As cells transition from one cell type (state) to

another they lose/gain expression of certain

markers.

Here the CD45RA marker is lost as cells

transition from naive to memory status.

This results in a bimodal distributions of the

intensity of CD45RA.

5 of 28


Manual Method of Identifying Cell Phenotypes

% of CD25+ Naive Cells

% of Memory Cells

6 of 28


Identifying all possible cell subsets is tedious and error-prone.

P parameters results in the order of P^2 bi-dimensional comparisons.

Manual analysis also introduces operator bias.

Unexpected or rare cell populations may be missed.

Issues with Manual Analysis of Flow Cytometry Data

7 of 28


Flow Data Genetic Data

P ~ 100,000P ~ 10

N ~

100

0

N ~

1,0

0,00

0

N > 10,000 x P N < 100 x P

VS

Distance-based clustering:

- hclust- kmeans

Density-based clustering:

- identifying regions of significantly high-density- fitting mixture models

N cells N individualsP cellular markers P SNPs

8 of 28


Motivation for SPADE

Heading towards high-dimensional data sets:- pooling of datasets- mass cytometry

Distance based methods are fast at the expense of storing the entire distance matrix.

Distance-based clustering is well suited for high-dimensional data sets when data is too sparse for density-based methods.

9 of 28


Primarily a visualisation tool for revealing structure in point clouds as obtained from flow cytometry.

A clustering method with rare event detection thanks to density-dependent downsampling.

Four main steps in SPADE:

1) Density-dependent downsampling2) Agglomerative clustering3) Minimum spanning tree construction4) Upsampling

SPADE:spanning-tree progression analysis of density-normalised events

10 of 28


Outline of SPADE as applied to a simulated data set- Proof of concept- Structure of data preserved and rarer cell population identified

Analysis of mouse hematopoiesis using flow cytometry data- Ability to reconstruct a known hierarchy- Comparison to manual gating- Identified cell population missed in manual gating (dendritic cells)

Analysis of human hematopoiesis using mass cytometry data- Joining multiple stimulation experiments on core markers- Non-targeted cell population identified (NK cells)

Results from paper

11 of 28


SPADE: Spanning-tree Progression Analysis of Density-normalised Events

(i) A simulated two-parameter flow cytometry data set, with one rare population and three abundant populations.

(ii) Result of density-dependent down-sampling of the original data.

(iii) Agglomerative clustering result of the down-sampled cells. Adjacent clusters are drawn in alternating colors.

(iv) Minimum spanning tree that connects the cell clusters.

(v) Colored SPADE trees. Nodes are colored by the median intensities of protein markers of cells in each node, allowing visualization of the behaviors of the two markers across the entire heterogeneous cell population.

Input

Output

12 of 28


Density-dependent down-sampling: an example

1

0

1

2

3

4

1 0 1 2 3CD25

CD45RA

1

0

1

2

3

4

1 0 1 2 3CD25

CD45RA

N=200 N=50

After downsampling the density has been flattened to the target density while preserving rare clusters.

The green nodes can be used to build the minimum spanning-tree.

13 of 28


Extracting Cellular Hierarchy

Identification of dendritic cells missed by manual gating in f

Mouse Data

14 of 28


Identification of untargeted* cell type

NK Cells NK Cells

* CD127 and CD16 are typically not used to identify NK Cells.

Human Data

15 of 28


Joining multiple flow experiments on core surface markers

Core markers used to build tree. Other markers, functional or additional surface markers, used to annotate it.

2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

CD4

N = 229591 Bandwidth = 0.08117

Dens

ity

Core markers need to align across experiments.

Human Data

16 of 28


Visualisation of response

Pooling of experiments on common tree structure allows visualisation across many different experimental conditions.

Human Data

17 of 28


Visualises high dimensional data.

Exposes hierarchy in bottom-up manner thanks to spanning-tree.

Identification of novel and rare cell types in flow cytometry thanks to density-dependent downsampling.

Pooling of multiple experiments on common tree structure for meta-analysis.

Conclusion

18 of 28


Application to our data sets

19 of 28


Visualising response to stimulation

0U.lymphocytes2Median of Alexa.Fluor.488.A

0.64 2.14

Range: 0.02 to 0.98 pctile


0.64 2.14



0.64 2.14



0.64 2.14


CD45RA+ CD25-

CD45RA- CD25+

resting

increasing dose

Data courtesy of Tony Cutler

Applied to Flow Data:

20 of 28


Minimum spanning trees relate to single-linkage hierarchical

clustering as used in heatmaps for viewing genetic data

such as SNP arrays.

21 of 28


Copy Number Imputation from SNP Log R Ratio and Theta

0 1 2 3

01

23

KIR3DS1

KIR

3DL1

x x

xx x

x x

x

60.38%

29.58%

3.66%

1.83%

2.10%

0.47% 0.34%

1.63%

Log_R_

Ratio.se

q.rs674268

Theta.seq.rs598452

Log_R_

Ratio.rs12461010

Theta.seq.rs1654644

Theta.seq.rs10500318

Theta.rs581623

Theta.seq.rs649216

Theta.seq.rs648689

Theta.rs12461010

Log_R_

Ratio.se

q.t1d.19.60054973.T.C

Log_R_

Ratio.rs640345

Log_R_

Ratio.se

q.rs10407958

Log_R_

Ratio.se

q.t1d.19.60014013.A.C

Log_R_

Ratio.se

q.rs597598

Log_R_

Ratio.se

q.rs55761930

Log_R_

Ratio.se

q.t1d.19.60056605.A.T

Log_R_

Ratio.se

q.rs3865510

Log_R_

Ratio.se

q.rs10500318

Log_R_

Ratio.se

q.rs649216

Log_R_

Ratio.rs3865507

Log_R_

Ratio.se

q.rs592645

Log_R_

Ratio.se

q.rs648689

Log_R_

Ratio.se

q.t1d.19.60034052.C.T

Log_R_

Ratio.se

q.rs12976350

Log_R_

Ratio.rs10422740

Log_R_

Ratio.se

q.t1d.19.60056721.C.T

Log_R_

Ratio.se

q.rs2295805

Log_R_

Ratio.rs3826878

Theta.rs3826878

Log_R_

Ratio.se

q.rs4806568

Log_R_

Ratio.rs581623

Log_R_

Ratio.rs4806585

Log_R_

Ratio.se

q.t1d.19.60007809.C.G

Log_R_

Ratio.se

q.rs604999

Log_R_

Ratio.se

q.rs1654644

Theta.seq.t1d.19.60054973.T.C

Theta.seq.t1d.19.60014013.A.C

Log_R_

Ratio.se

q.rs604077

Log_R_

Ratio.se

q.rs598452


Theta.seq.t1d.19.60007809.C.G

Theta.seq.rs592645

Log_R_

Ratio.se

q.rs62122181

Theta.seq.rs674268

Theta.seq.rs604077

Theta.rs640345

Theta.seq.rs597598

Theta.seq.rs604999



Theta.seq.rs4806568


Theta.seq.t1d.19.60034052.C.T

Theta.seq.t1d.19.60056605.A.T

Theta.seq.rs3865510

Theta.rs3865507

Theta.rs10422740

Theta.rs4806585


Theta.seq.rs2295805

Log R Ratio and Theta from ImmunoChip of 30 SNPs in gene A and B regionCopy Number Calls Using qPCR data

Gene A

Gene

B 0-2 1-2

0-1 1-1 2-1

3-02-01-0

Applied to Genetic Data:

22 of 28


Minimal Spanning TreeSingle-linkage hclust1510

50

23 of 28


Minimum spanning trees another way of visualising high-

dimensional data?

24 of 28


Lum, P. Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., et al. (2013). Extracting insights from the shape of complex data using topology. Scientific Reports, 3. doi:10.1038/srep01236

Topological Data Analysis

25 of 28


Qiu, P. (2012). Inferring Phenotypic Properties from Single-Cell Characteristics. PloS one, 7(5), e37038. doi:10.1371/journal.pone.0037038

P

P and Q can be comparing using the Earth Mover Distance subject to minimising

Comparing distribution of cells across the same tree

EMD(P,Q) =

Pmi=1

Pnj=1 fijdijPm

i=1

Pnj=1 fij

.

P and Q are cell distributions across the nodes of the same tree.

Q

fij

dij

# of cells moved from node i to node j

# of hops in shortest path from node i to j

Pmi=1

Pnj=1 fijdij

26 of 28


Calculating density

All these points will be assigned the same local density (LD).

According to the target density (TD) and outlier density (OD), SPADE keeps each cell i with the following probability:

27 of 28


Building Minimal Spanning Tree

The minimal spanning-tree (MST) is the shortest path which connects all nodes.

Layout of tree determined by Fruchterman-Reingold algorithm (see Methods).

MST is related to single-linkage hierarchical clustering algorithm (see later).

28 of 28


Average Linkage

Log_R_

Ratio.se

q.rs674268

Theta.seq.rs598452

Log_R_

Ratio.rs12461010

Theta.seq.rs1654644


Theta.rs581623

Theta.seq.rs649216

Theta.seq.rs648689

Theta.rs12461010

Log_R_

Ratio.se

q.t1d.19.60054973.T.C

Log_R_

Ratio.rs640345

Log_R_

Ratio.se

q.rs10407958

Log_R_

Ratio.se

q.t1d.19.60014013.A.C

Log_R_

Ratio.se

q.rs597598

Log_R_

Ratio.se

q.rs55761930

Log_R_

Ratio.se

q.t1d.19.60056605.A.T

Log_R_

Ratio.se

q.rs3865510

Log_R_

Ratio.se

q.rs10500318

Log_R_

Ratio.se

q.rs649216

Log_R_

Ratio.rs3865507

Log_R_

Ratio.se

q.rs592645

Log_R_

Ratio.se

q.rs648689

Log_R_

Ratio.se

q.t1d.19.60034052.C.T

Log_R_

Ratio.se

q.rs12976350

Log_R_

Ratio.rs10422740

Log_R_

Ratio.se

q.t1d.19.60056721.C.T

Log_R_

Ratio.se

q.rs2295805

Log_R_

Ratio.rs3826878

Theta.rs3826878

Log_R_

Ratio.se

q.rs4806568

Log_R_

Ratio.rs581623

Log_R_

Ratio.rs4806585

Log_R_

Ratio.se

q.t1d.19.60007809.C.G

Log_R_

Ratio.se

q.rs604999

Log_R_

Ratio.se

q.rs1654644

Theta.seq.t1d.19.60054973.T.C

Theta.seq.t1d.19.60014013.A.C

Log_R_

Ratio.se

q.rs604077

Log_R_

Ratio.se

q.rs598452


Theta.seq.t1d.19.60007809.C.G

Theta.seq.rs592645

Log_R_

Ratio.se

q.rs62122181

Theta.seq.rs674268

Theta.seq.rs604077

Theta.rs640345

Theta.seq.rs597598

Theta.seq.rs604999



Theta.seq.rs4806568



Theta.seq.t1d.19.60056605.A.T

Theta.seq.rs3865510

Theta.rs3865507

Theta.rs10422740

Theta.rs4806585


Theta.seq.rs2295805

L og _

R _R a

t i o. r s

4 80 6

5 85

L og _

R _R a

t i o. s e

q .t 1

d .1 9

. 60 0

0 78 0

9 .C .

G

L og _

R _R a

t i o. r s

3 82 6

8 78

L og _

R _R a

t i o. s e

q .r s

5 98 4

5 2

L og _

R _R a

t i o. s e

q .t 1

d .1 9

. 60 0

5 49 7

3 .T . C

L og _

R _R a

t i o. r s

6 40 3

4 5

L og _

R _R a

t i o. s e

q .t 1

d .1 9

. 60 0

1 40 1

3 .A .

C

L og _

R _R a

t i o. s e

q .r s

5 97 5

9 8

L og _

R _R a

t i o. s e

q .r s

5 57 6

1 93 0

L og _

R _R a

t i o. r s

3 86 5

5 07

L og _

R _R a

t i o. s e

q .r s

6 49 2

1 6

L og _

R _R a

t i o. s e

q .r s

3 86 5

5 10

L og _

R _R a

t i o. s e

q .r s

1 05 0

0 31 8

L og _

R _R a

t i o. s e

q .r s

6 48 6

8 9

L og _

R _R a

t i o. s e

q .r s

5 92 6

4 5

L og _

R _R a

t i o. s e

q .r s

1 04 0

7 95 8

L og _

R _R a

t i o. s e

q .t 1

d .1 9

. 60 0

5 66 0

5 .A .

T

L og _

R _R a

t i o. s e

q .r s

4 80 6

5 68

L og _

R _R a

t i o. s e

q .t 1

d .1 9

. 60 0

3 40 5

2 .C .

T

L og _

R _R a

t i o. s e

q .r s

1 29 7

6 35 0

L og _

R _R a

t i o. r s

1 04 2

2 74 0

L og _

R _R a

t i o. r s

5 81 6

2 3

T he t

a .r s

3 82 6

8 78

L og _

R _R a

t i o. s e

q .t 1

d .1 9

. 60 0

5 67 2

1 .C .

T

L og _

R _R a

t i o. s e

q .r s

1 65 4

6 44

L og _

R _R a

t i o. s e

q .r s

2 29 5

8 05

T he t

a .s e

q .r s

1 04 0

7 95 8

T he t

a .s e

q .t 1

d .1 9

. 60 0

0 78 0

9 .C .

G

L og _

R _R a

t i o. s e

q .r s

6 04 0

7 7

L og _

R _R a

t i o. s e

q .r s

6 04 9

9 9

T he t

a .s e

q .t 1

d .1 9

. 60 0

5 49 7

3 .T . C

T he t

a .s e

q .t 1

d .1 9

. 60 0

1 40 1

3 .A .

C

L og _

R _R a

t i o. s e

q .r s

6 74 2

6 8

L og _

R _R a

t i o. r s

1 24 6

1 01 0

T he t

a .s e

q .r s

5 98 4

5 2

T he t

a .s e

q .r s

5 92 6

4 5

L og _

R _R a

t i o. s e

q .r s

6 21 2

2 18 1

T he t

a .r s

5 81 6

2 3

T he t

a .s e

q .r s

6 49 2

1 6

T he t

a .s e

q .r s

6 48 6

8 9

T he t

a .s e

q .r s

1 65 4

6 44

T he t

a .s e

q .r s

5 57 6

1 93 0

T he t

a .s e

q .r s

1 29 7

6 35 0

T he t

a .s e

q .r s

6 74 2

6 8

T he t

a .s e

q .r s

4 80 6

5 68

T he t

a .s e

q .r s

6 21 2

2 18 1

T he t

a .r s

1 24 6

1 01 0

T he t

a .s e

q .t 1

d .1 9

. 60 0

3 40 5

2 .C .

T

T he t

a .s e

q .t 1

d .1 9

. 60 0

5 66 0

5 .A .

T

T he t

a .s e

q .r s

3 86 5

5 10

T he t

a .r s

4 80 6

5 85

T he t

a .s e

q .t 1

d .1 9

. 60 0

5 67 2

1 .C .

T

T he t

a .s e

q .r s

2 29 5

8 05

T he t

a .s e

q .r s

1 05 0

0 31 8

T he t

a .r s

3 86 5

5 07

T he t

a .r s

1 04 2

2 74 0

T he t

a .s e

q .r s

6 04 0

7 7

T he t

a .r s

6 40 3

4 5

T he t

a .s e

q .r s

5 97 5

9 8

T he t

a .s e

q .r s

6 04 9

9 9

Single Linkage

Health & Medicine

Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE