Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014....

Preview:

Citation preview

Efficient and Accurate Clustering for Large-Scale Genetic Mapping

V. Strnadová (Neeley) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez ,

John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker

* ++ § ¶

*,++ * *, ¶ §

§++ *,§, ¶ *

Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute

Motivation

• High-throughput sequencing methods have produced a flood of inexpensive genetic information

• Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

Data

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

𝑚1

𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4𝑚5

𝑚6

𝑚7𝑚10𝑚11𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

cluster

Data

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

𝑚1

𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4𝑚5

𝑚6

𝑚7𝑚10𝑚11𝑚12

𝑚13𝑚14

𝑚8

𝑚15

𝑚2

𝑚1

𝑚9

Linkage group 1

Linkage group 2

cluster

Linkage group 1 Linkage group 2

Data

𝑚11

𝑚6

𝑚13

𝑚3

𝑚12

𝑚10

𝑚4

𝑚7

𝑚5𝑚14

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

𝒎𝟏

𝒎𝟐

𝒎𝟖𝒎𝟗

𝒎𝟏𝟓

𝒎𝟑

𝒎𝟒𝒎𝟓

𝒎𝟔

𝒎𝟕𝒎𝟏𝟎𝒎𝟏𝟏𝒎𝟏𝟐

𝒎𝟏𝟑𝒎𝟏𝟒

Linkage group 1

Linkage group 2

cluster

Data

𝑚8

𝑚15

𝑚2

𝑚1

𝑚9

Linkage group 1 Linkage group 2

𝑚11

𝑚6

𝑚13

𝑚3

𝑚12

𝑚10

𝑚4

𝑚7

𝑚5𝑚14

The Need for Large-Scale Clustering in Genetic Mapping

• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers

• A major bottleneck is the linkage-group-finding phase

• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers

• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers

• A major bottleneck is the linkage-group-finding phase

• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers

Our solution: A fast, scalable clustering algorithm tailored to genetic marker data

The Need for Large-Scale Clustering in Genetic Mapping

Standard Approach to Genetic Marker Clustering

cluster

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

Data

𝑚1

𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4𝑚5

𝑚6

𝑚7𝑚10𝑚11𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

Standard Approach to Genetic Marker Clustering

(1)

𝑚1𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

• (2) Cut all edges below a LOD threshold

• (3) The resulting connected components = linkage groups

Standard Approach to Genetic Marker Clustering

(1)

𝑚1𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

• (2) Cut all edges below a LOD threshold

• (3) The resulting connected components = linkage groups

LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

Formally,

𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑁𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+𝑁𝑅𝑖𝑗

Where:

𝑅𝑖𝑗 = number of recombinant offspring

𝑁𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅

𝑅+𝑁𝑅

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

Formally,

𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Where:

𝑅𝑖𝑗 = number of recombinant offspring

𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗

𝑅𝑖𝑗+ 𝑅𝑖𝑗

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝑳𝑶𝑫(𝒎𝒊,𝒎𝒋) = 𝐥𝐨𝐠𝟏𝟎(𝟏 − 𝟏 𝟑)

𝟐( 𝟏 𝟑)𝟏

𝟎. 𝟓𝟑= 𝟎. 𝟎𝟕𝟒

Formally,

𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Where:

𝑅𝑖𝑗 = number of recombinant offspring

𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗

𝑅𝑖𝑗+ 𝑅𝑖𝑗

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

Standard Approach to Genetic Marker Clustering

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

(2) Cut all edges below a LOD threshold

(2)

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Standard Approach to Genetic Marker Clustering

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

(2) Cut all edges below a LOD threshold

(3) The resulting connected components = linkage groups

(3)

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Our Approach: The BubbleCluster Algorithm

Primary assumption: Clusters have a “linear structure”

Our Approach: The BubbleCluster Algorithm

Primary assumption: Clusters have a “linear structure”

Key idea: Maintain a set of representative or “sketch” points which reveal the cluster structure

LOD threshold

Input: set of markers, LOD threshold 𝜏, non-missing threshold 𝜂, low-quality threshold 𝑐, cluster size threshold 𝜎

Output: set of clusters C and set of representative points R

Phase I: Build initial set of clusters and set of representative points using high-quality markers (those with at least 𝜂 non-missing entries)

Phase II: Add low quality markers (less than 𝜂 non-missing entries) to intialset of clusters

Phase III: Attempt to merge small clusters with large

The BubbleCluster Algorithm: Overview

The BubbleCluster Algorithm

𝒎

Iteration i:find 𝑟𝑀𝐴𝑋 ∶= 𝑟𝑗 for which 𝐿𝑂𝐷(𝑚, 𝑟𝑗) is

maximal;set 𝐶𝑀𝐴𝑋 ≔ 𝐶𝐾 ∈ 𝐶 containing 𝑟𝑀𝐴𝑋

𝐿𝑂𝐷(𝑚, 𝑟𝑗)

𝑟𝑗𝐶1 𝐶2

The BubbleCluster Algorithm

If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)

𝒎𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)

𝑟𝑀𝐴𝑋𝐶1 𝐶2

The BubbleCluster Algorithm

If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)𝐶 = 𝐶 ∪ {𝑚}

𝒎

𝑟𝑀𝐴𝑋𝐶1 𝐶2

𝐶3

The BubbleCluster Algorithm

Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )

𝒎

𝑟𝑀𝐴𝑋

𝐶1𝐶2

𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)

The BubbleCluster Algorithm

Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )𝐶𝑀𝐴𝑋 = 𝐶𝑀𝐴𝑋 ∪ {𝑚}

𝒎

𝑟𝑀𝐴𝑋

𝐶1 = 𝐶𝑀𝐴𝑋𝐶2

Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

The BubbleCluster Algorithm

The BubbleCluster Algorithm

Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )Add 𝑚 to representative points of 𝐶𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋

The BubbleCluster Algorithm

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋

The BubbleCluster Algorithm

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

The BubbleCluster Algorithm

𝒎

𝐶1 𝐶2

If 𝑚 has a LOD score above the threshold to two clusters,

𝐶𝑁𝐸𝑊 = 𝐶1 ∪ 𝐶2

If 𝑚 has a LOD score above the threshold to two clusters, Then merge the clusters and add m to the merged cluster

𝒎

The BubbleCluster Algorithm

End of Phase IStop when all markers with at least 𝜂 non-missing entries have been processed

Running time: 𝑶 |𝜢| 𝐥𝐨𝐠𝟐 𝚮 + |𝜢||𝑹|where: |𝛨| = size of high-quality marker set,

|𝑅| = size of representative point set

Phase II: add low-quality markers

Phase III: merge small clusters

BubbleCluster Parameters

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃)𝑁𝑅𝜃𝑅

0.5𝑅+𝑁𝑅

Highest achievable LOD: lim𝑅→0log10

(1 − 𝜃)𝑁𝑅𝜃𝑅

0.5𝑅+𝑁𝑅= 𝑁𝑅 log10 2

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

𝜏1

𝜏2

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

𝜏1

𝜏2

BubbleCluster Parameters

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Example LOD: log10(1 − 1 3)

2( 1 3)1

0.53= 0.074

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

𝜏1

𝜏2

BubbleCluster Parameters

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Highest achievable LOD: lim𝑅𝑖𝑗→0log10

(1 − 𝜃𝑖𝑗) 𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅+ 𝑅𝑖𝑗

= 𝑅𝑖𝑗log10 2

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

𝜏1

𝜏2

BubbleCluster Parameters

Evaluation Metric: 𝐹-score

Given a golden standard clustering, the 𝐹-score measures the quality of another clustering by comparing it to the golden standard

Range: 0 – 1

The 𝐹-score is a harmonic mean of precision and recallAn 𝐹-score of 1 indicates perfect precision and perfect recall for every golden

standard cluster

Results: BubbleCluster on Real Data Sets

Dataset Size Time F-Score

Barley 64K 15 sec 0.9993

Switchgrass 113K 8.9 min 0.9745

Switchgrass 548K 1.9 hrs 0.9894

Wheat 1.58M 1.22 hrs N/A *

* Results under review at Genome Biology

Comparison of Clustering Algorithms for Simulated Data

ClusteringMethod

12.5K Markers 25K Markers

F-score Time F-score Time

JoinMap 0.99964 14 min 0.99982 46 min

MSTMap 0.99964 4.5 min 0.99982 20 min

PIC 0.47024 11 sec (+ 4min)

0.60782 44 sec (+ 16.5min)

BubbleCluster 0.99964 6 sec 0.99982 15 sec

Simulated data created with Nicholas Tinker’s Spaghetti Software.

4.4

9.8

21.0

47.2

96.6

237.8

6.7

14.0

31.7

86.4

198.7

475.1

1

2

4

8

16

32

64

128

256

512

1024

12.5 25 50 100 200 400

1E-5

1E-4

1E-3

1E-2

1E-1

1E+0

Ru

nti

me

(s)

Dataset Size (in thousands of markers)

Erro

r (1

-Fs

core

)

error, 65% missing

error, 35% missing

runtime, 65% missing

runtime, 35% missing

Scaling Results for Bubble Cluster on Simulated Data

Effect of the LOD threshold

LOD threshold 5 10 15 20 25 30

F-Score 0.6225 0.9999 0.9999 0.9999 0.9999 0.9999

Time (s) 48.6 67.0 70.9 82.0 106 170

Fixed missing entry threshold, increasing LOD threshold

200K markers, 300 individuals, 35% missing rate

𝜏1𝜏2 vs.

Effect of the missing data threshold

Fixed LOD threshold, increasing non-missing entry threshold

Non-missing threshold

132 166 172 179 186 192

F-Score 0.9999 0.9999 0.9992 0.9930 0.9610 0.8948

Time (s) 82.0 84.6 82.7 83.0 81.7 82.0

200K markers, 300 individuals, 35% missing rate

𝜏 𝜏𝜏vs.

ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data

𝒎

𝐶1 𝐶2

ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data

𝒎

𝐶1 𝐶2While remaining highly accurate, we outperform popular existing tools in both runtime and scalability

Clustering Method 12.5K Markers 25K Markers

F-score Time F-score Time

JoinMap 0.99964 14 min 0.99982 46 min

MSTMap 0.99964 4.5 min 0.99982 20 min

PIC 0.47024 11 sec (+ 4 min)

0.60782 44 sec (+ 16.5 min)

BubbleCluster 0.99964 6 sec 0.999982 15 sec

Future Work

• Use representative points as starting point for ordering phase

Future Work

• Use representative points as starting point for ordering phase

• Provide a more thorough theoretical analysis of achievable clustering as well as order accuracy given assumptions about error and missing data rates

• Develop efficient and accurate, large-scale genetic mapping software

Thank You

Code for BubbleCluster soon available at: www.ucsb.edu/~veronika

Backup Slides

Choosing LOD and non-missing thresholdGoal: minimize 𝑷(mistake) and maximize 𝑭- score

Let 𝑝 = 𝑃 𝐿𝑂𝐷 𝑚𝑖 , 𝑚𝑗 > 𝐿𝑂𝐷𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑥𝑖 ∈ 𝐶𝑖 , 𝑥𝑗 ∈ 𝐶𝑗 , 𝑖 ≠ 𝑗)

Let 𝜏 = LOD threshold

By definition of the LOD score, 𝑝 =1

10𝜏

Let 𝑛𝑐𝑜𝑚𝑝 = number of LOD comparisons we make

Then, if we want to ensure that (1 − 𝑝)𝑛𝑐𝑜𝑚𝑝< 1 − 𝜀 then we need:

𝜏 > log10(1

1 − (1 − 𝜀) 1𝑛𝑐𝑜𝑚𝑝)

At the same time, we want to include a marker in the high-quality set only if we expect that it will achieve a LOD of 𝜏 or greater with another marker, requiring:

𝑛𝑛𝑚 > 𝜏(1 − 𝜇) log10 2

Where 𝜇 is the missing rate and 𝑛𝑛𝑚 is the number of non-missing entries in the marker

Evaluation Metric for Cluster Quality

F-score (range: 0 to 1)• Given a “golden standard clustering”, the F-score measures the

quality of a clustering as follows:

• The F-score between a golden standard cluster 𝑔 and a test cluster 𝑐is a harmonic mean of precision 𝑃 and recall 𝑅:

𝐹𝑠𝑐𝑜𝑟𝑒 𝑔, 𝑐 =2𝑃𝑅

𝑃 + 𝑅• The overall F-score between the golden standard clustering G and a

test clustering C is a weighted average of the F-scores for each golden standard cluster 𝑔:

𝑜𝑣𝑒𝑟𝑎𝑙𝑙_𝐹𝑠𝑐𝑜𝑟𝑒 𝐺, 𝐶 =1

𝑚

𝑔 ∋ 𝐺

𝑔 ∗ max𝑐 ∋𝐶𝐹𝑠𝑐𝑜𝑟𝑒(𝑔, 𝑐)

Local Linearity Assumption

Although the LOD score does not obey the triangle inequality, we assume that it does at close ranges and with enough data

𝑳𝑶𝑫(𝒎, 𝒓𝟏)𝑳𝑶𝑫(𝒎, 𝒓𝟐)

𝒎

𝒓𝟏𝒓𝟐𝑳𝑶𝑫(𝒓𝟐, 𝒓𝟏)

𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒎, 𝒓𝟏&&

𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐&&

𝑳𝑶𝑫 𝒎, 𝒓𝟏 > 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐

𝒎 is a new boundary point

The Effect of the LOD threshold

The standard approach to clustering can be viewed as single linkage clustering

Time complexity: 𝑂(𝑀2)

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

The Effect of the LOD threshold

A high LOD threshold ensures that the only edges that remain in the completely connected graph are between markers that are extremely likely to be genetically linked

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

The Effect of the LOD threshold

Example: LOD threshold = 10

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

The Effect of the LOD threshold

Example: LOD threshold = 8

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

The Effect of the LOD threshold

Example: LOD threshold = 7

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Recommended