Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Machine Learning for Personalized Medicine
Karsten Borgwardt
ETH Zurich Fraunhofer-Institut Kaiserslautern, September 30, 2016
Department Biosystems
The Need for Machine Learning in Computational Biology
BGI Hong Kong, Tai Po Industrial Estate, Hong Kong
High-throughput technologies:
Genome and RNA sequencing
Compound screening
Genotyping chips
Bioimaging
Molecular databases are growing much faster than our knowledge of biological processes.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 2 / 76
The Evolution of Bioinformatics
Classic Bioinformatics: Focus on Molecules
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 3 / 76
Classic Bioinformatics: Focus on Molecules
Large collections of molecular data
Gene and protein sequencesGenome sequenceProtein structuresChemical compounds
Focus: Inferring properties of molecules
Predict the function of a gene given its sequencePredict the structure of a protein given its sequencePredict the boundaries of a gene given a genome segmentPredict the function of a chemical compound given its molecular structure
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 4 / 76
Example: Predicting Function from Structure
Structure-Activity Relationship
Source: Joska T M , and Anderson A C Antimicrob. Agents Chemother. 2006;50:3435-3443
Fundamental idea: Similarity in structure implies similarity in function
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 5 / 76
Measuring the Similarity of Graphs
How similar are two graphs?
How similar is their structure?How similar are their node labels and edge labels?
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 6 / 76
Graph Comparison
1 Graph isomorphism and subgraph isomorphism checking
Exact matchExponential runtime
2 Graph edit distances
Involves definition of a cost functionTypically subgraph isomorphism as intermediate step
3 Topological descriptors
Lose some of the structural information represented by the graph orExponential runtime effort
4 Graph kernels (Gartner et al, 2003; Kashima et al. 2003)
Goal 1: Polynomial runtime in the number of nodesGoal 2: Applicable to large graphsGoal 3: Applicable to graphs with attributes
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 7 / 76
Graph Kernels I
KernelsKey concept: Move problem to feature space H.Naive explicit approach:
Map objects x and x′ via mapping φ to H.Measure their similarity in H as 〈φ(x), φ(x′)〉.
Kernel Trick: Compute inner product in H as kernel in input space k(x, x′) = 〈φ(x), φ(x′)〉.
R2 ⇒ HDepartment Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 8 / 76
Graph Kernels II
Graph kernels
Kernels on pairs of graphs(not pairs of nodes)Instance of R-Convolution kernels (Haussler, 1999):
Decompose objects x and x′ into substructures.Pairwise comparison of substructures via kernels to compare x and x′.
A graph kernel makes the whole family of kernel methods applicable to graphs.
G G’Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 9 / 76
Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, NIPS 2009)
1
34
2
1
5
1
34
5
2
2
1,4
3,2454,1135
2,35
1,4
5,234
1,4
3,2454,1235
5,234
2,3
2,45
1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’
2,35
6
7
8
10
11
12
4,1135
1,4
5,234
3,245
4,1235
2,3
2,45 139
1st iterationResult of step 3: label compression
13 13
6 6 6 7
8 9
11 1210 10
1st iterationResult of step 4: relabeling
End of the 1st iterationFeature vector representations of G and G’
φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)
WLsubtree
φ (G’) = (
Counts oforiginal
node labels
1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1
Counts ofcompressednode labels
)(1)
WLsubtree
a b
c d
e
k (G,G’)=< φ (G), φ (G’) >=11.(1)
WLsubtree(1) (1)
WLsubtree WLsubtree
G’G
G’G G’G
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 10 / 76
Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, NIPS 2009)
1
34
2
1
5
1
34
5
2
2
1,4
3,2454,1135
2,35
1,4
5,234
1,4
3,2454,1235
5,234
2,3
2,45
1st iterationResult of steps 1 and 2: multiset-label determination and sortingGiven labeled graphs G and G’
2,35
6
7
8
10
11
12
4,1135
1,4
5,234
3,245
4,1235
2,3
2,45 139
1st iterationResult of step 3: label compression
13 13
6 6 6 7
8 9
11 1210 10
1st iterationResult of step 4: relabeling
End of the 1st iterationFeature vector representations of G and G’
φ (G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)(1)
WLsubtree
φ (G’) = (
Counts oforiginal
node labels
1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1
Counts ofcompressednode labels
)(1)
WLsubtree
a b
c d
e
k (G,G’)=< φ (G), φ (G’) >=11.(1)
WLsubtree(1) (1)
WLsubtree WLsubtree
G’G
G’G G’G
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 11 / 76
Subtree-like Patterns
1
2
3
4
5
6
1
1 3 1 51 2 4 5
2 63
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 12 / 76
Weisfeiler-Lehman Kernel: Theoretical Runtime Properties
Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011)Algorithm: Repeat the following steps h times
1 Sort: Represent each node v as sorted list Lv of its neighbors (O(m))2 Compress: Compress this list into a hash value h(Lv ) (O(m))3 Relabel: Relabel v by the hash value h(Lv ) (O(n))
Runtime analysis
per graph pair: Runtime O(m h)for N graphs: Runtime O(N m h + N2 n h) (naively O(N2 m h))
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 13 / 76
Weisfeiler-Lehman Kernel: Empirical Runtime Properties
101
102
103
10−1
100
101
102
103
104
105
Number of graphs N
Runtim
e in s
econds
200 400 600 800 10000
200
400
600
Graph size n
Runtim
e in s
econds
2 4 6 80
5
10
15
20
Subtree height h
Runtim
e in s
econds
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
5
10
15
Graph density cR
untim
e in s
econds
pairwise
global
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 14 / 76
Weisfeiler-Lehman Kernel: Runtime and Accuracy
MUTAG NCI1 NCI109 D&D
10 sec1 minute
1 hour
1 day
10 days
100 days
1000 days
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
WLRG3 GraphletRWSP
graph size
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 15 / 76
The Evolution of Bioinformatics
Modern Bioinformatics: Focus on Individuals
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 16 / 76
Modern Bioinformatics: Focus on Individuals
High-throughput technologies now enable the collection of molecular information onindividuals
Microarrays to measure gene expression levelsChips to determine the genotype of an individualSequencing to determine the genome sequence of an individual
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 17 / 76
Genetics: Association Studies
Genome-Wide Association Studies (GWAS)
bco D. Weigel
One considers genome positions that differ between individuals, that is Single NucleotidePolymorphisms (SNPs) (more general: genetic locus or genomic variant).
Problem size: 105-107 SNPs per genome, 102 to 105 individualsDepartment Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 18 / 76
Genetics: Manhattan Plots
The standard statistical analysis in Genetics: Generating a Manhattan plot ofassociation signals
0 4000000 8000000 12000000 16000000
2
4
6
Manhattan-plot for chromosome Chr2
-log1
0(p-
valu
e)
chromosomal position [bp]
-log10(p-value) Bonferroni threshold [0.05]
Phenotype: Flower color-related trait of Arabidopsis thaliana
A plot of genome positions versus p-values of association/correlation.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 19 / 76
Genetics: Missing Heritability
More than 1200 new disease loci were detected over the last decade.The phenotypic variance explained by these loci is disappointingly low:
REVIEWS
Finding the missing heritability of complexdiseasesTeri A. Manolio1, Francis S. Collins2, Nancy J. Cox3, David B. Goldstein4, Lucia A. Hindorff5, David J. Hunter6,Mark I. McCarthy7, Erin M. Ramos5, Lon R. Cardon8, Aravinda Chakravarti9, Judy H. Cho10, Alan E. Guttmacher1,Augustine Kong11, Leonid Kruglyak12, Elaine Mardis13, Charles N. Rotimi14, Montgomery Slatkin15, David Valle9,Alice S. Whittemore16, Michael Boehnke17, Andrew G. Clark18, Evan E. Eichler19, Greg Gibson20, Jonathan L. Haines21,Trudy F. C. Mackay22, Steven A. McCarroll23 & Peter M. Visscher24
Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases andtraits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relativelysmall increments in risk, and explain only a small proportion of familial clustering, leading many to question how theremaining, ‘missing’ heritability can be explained. Here we examine potential sources of missing heritability and proposeresearch strategies, including and extending beyond current genome-wide association approaches, to illuminate the geneticsof complex diseases and enhance its potential to enable effective disease prevention or treatment.
Many common human diseases and traits are known tocluster in families and are believed to be influenced byseveral genetic and environmental factors, but untilrecently the identification of genetic variants contributing
to these ‘complex diseases’ has been slow and arduous1. Genome-wideassociation studies (GWAS), in which several hundred thousand tomore than a million single nucleotide polymorphisms (SNPs) areassayed in thousands of individuals, represent a powerful new tool forinvestigating the genetic architecture of complex diseases1,2. In the pastfew years, these studies have identified hundreds of genetic variantsassociated with such conditions and have provided valuable insightsinto the complexities of their genetic architecture3,4.
The genome-wide association (GWA) method represents animportant advance compared to ‘candidate gene’ studies, in whichsample sizes are generally smaller and the variants assayed are limitedto a selected few, often on the basis of imperfect understanding ofbiological pathways and often yielding associations that are difficultto replicate5,6. GWAS are also an important step beyond family-basedlinkage studies, in which inheritance patterns are related to severalhundreds to thousands of genomic markers. Despite many clearsuccesses in single-gene ‘Mendelian’ disorders7,8, the limited successof linkage studies in complex diseases has been attributed to their lowpower and resolution for variants of modest effect9–11.
The underlying rationale for GWAS is the ‘common disease,common variant’ hypothesis, positing that common diseases areattributable in part to allelic variants present in more than 1–5% ofthe population12–14. They have been facilitated by the development ofcommercial ‘SNP chips’ or arrays that capture most, although not all,common variation in the genome. Although the allelic architecture ofsome conditions, notably age-related macular degeneration, for themost part reflects the contributions of several variants of large effect(defined loosely here as those increasing disease risk by twofold ormore), most common variants individually or in combination conferrelatively small increments in risk (1.1–1.5-fold) and explain only asmall proportion of heritability—the portion of phenotypic variancein a population attributable to additive genetic factors3. For example,at least 40 loci have been associated with human height, a classiccomplex trait with an estimated heritability of about 80%, yet theyexplain only about 5% of phenotypic variance despite studies of tensof thousands of people15. Although disease-associated variants occurmore frequently in protein-coding regions than expected from theirrepresentation on genotyping arrays, in which over-representation ofcommon and functional variants may introduce analytical biases, thevast majority (.80%) of associated variants fall outside codingregions, emphasizing the importance of including both coding andnon-coding regions in the search for disease-associated variants3.
1National Human Genome Research Institute, Building 31, Room 4B09, 31 Center Drive, MSC 2152, Bethesda, Maryland 20892-2152, USA. 2National Institutes of Health, Building 1,Room 126, MSC 0148, Bethesda, Maryland 20892-0148, USA. 3Departments of Medicine and Human Genetics, University of Chicago, Room A612, MC 6091, 5841 South MarylandAvenue, Chicago, Illinois 60637, USA. 4Duke University, The Institute for Genome Sciences and Policy (IGSP), Box 91009, Durham, North Carolina 27708, USA. 5National HumanGenome Research Institute, Office of Population Genomics, Suite 4076, MSC 9305, 5635 Fishers Lane, Rockville, Maryland 20892-9305, USA. 6Department of Epidemiology, HarvardSchool of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, USA. 7University of Oxford, Oxford Centre for Diabetes, Endocrinology and Metabolism, ChurchillHospital, Old Road, Oxford OX3 7LJ, UK, and Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK. 8GlaxoSmithKline, 709Swedeland Road, King of Prussia, Pennsylvania 19406, USA. 9McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 North BroadwayBRB579, Baltimore, Maryland 21205, USA. 10Yale University, Department of Medicine, Division of Digestive Diseases, 333 Cedar Street, New Haven, Connecticut 06520-8019, USA.11deCODE Genetics, Sturlugata 8, Reykjavik IS-101, Iceland. 12Lewis-Sigler Institute for Integrative Genomics, Howard Hughes Medical Institute, and Department of Ecology andEvolutionary Biology, Princeton University, Princeton, New Jersey 08544, USA. 13The Genome Center, Washington University School of Medicine, 4444 Forest Park Avenue, CampusBox 8501, Saint Louis, Missouri 63108, USA. 14National Human Genome Research Institute, Center for Research on Genomics and Global Health, Building 12A, Room 4047, 12 SouthDrive, MSC 5635, Bethesda, Maryland 20892-5635, USA. 15Department of Integrative Biology, University of California, 3060 Valley Life Science Building, Berkeley, California 94720-3140, USA. 16Stanford University, Health Research and Policy, Redwood Building, Room T204, 259 Campus Drive, Stanford, California 94305, USA. 17Department of Biostatistics,University of Michigan, 1420 Washington Heights, Ann Arbor, Michigan 48109-2029, USA. 18Department of Molecular Biology and Genetics, 107 Biotechnology Building, CornellUniversity, Ithaca, New York 14853, USA. 19Howard Hughes Medical Institute and University of Washington, Department of Genome Sciences, 1705 North-East Pacific Street, FoegeBuilding, Box 355065, Seattle, Washington 98195-5065, USA. 20University of Queensland, School of Biological Sciences, Goddard Building, Saint Lucia Campus, Brisbane, Queensland4072, Australia. 21Vanderbilt University, Center for Human Genetics Research, 519 Light Hall, Nashville, Tennessee 37232-0700, USA. 22Department of Genetics, North CarolinaState University, Box 7614, Raleigh, North Carolina 27695, USA. 23Department of Genetics, Harvard Medical School, 77 Avenue Louis Pasteur, NRB 0330, Boston, Massachusetts02115, USA. 24Queensland Institute of Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia.
Vol 461j8 October 2009jdoi:10.1038/nature08494
747 Macmillan Publishers Limited. All rights reserved©2009
Manolio et al., Nature 2009Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 20 / 76
Genetics: Potential Reasons for Missing Heritability
Polygenic architectures
Most current analyses neglect additive or multiplicative effects between loci → need forsystems biology perspective
Small effect sizes
Not detectable with small sample sizes
Phenotypic effect of other genetic, epigenetic or non-genetic factors
Genetic properties ignored so far, e.g. rare SNPs
Chemical modifications of the genome
Environmental effect on phenotypeDepartment Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 21 / 76
Machine Learning in Genetics I
Moving to a Systems Biology Perspective
Multi-locus models:
Algorithms to discover trait-related systems of genetic loci
Increasing sample size:
Algorithms that support large-scale genotyping and phenotyping
Deciding whether additional information is required:
Tests that quantify the impact of additional (epi)genetic factors
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 22 / 76
Machine Learning in Genetics II
Moving to a Systems Biology Perspective
Multi-locus models:
Efficient algorithms for discovering trait-related SNP pairs: Epistasis discovery (KDD 2011, Human
Heredity 2012)
Increasing sample size:
Large-scale genotyping in A. thaliana (Nature Genetics 2011)
Automated image phenotyping of guppy fish (Bioinformatics 2012)
Automated image phenotyping of human lungs (IPMI 2013)
Deciding whether additional information is required:
Assessing the stability of methylation across generations of Arabidopsis lab strains (Nature 2011)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 23 / 76
Machine Learning in Genetics II
Moving to a Systems Biology Perspective
Multi-locus models:Efficient algorithms for discovering trait-related SNP pairs: Epistasis discovery (KDD 2011, Human
Heredity 2012)
Increasing sample size:
Large-scale genotyping in A. thaliana (Nature Genetics 2011)
Automated image phenotyping of guppy fish (Bioinformatics 2012)
Automated image phenotyping of human lungs (IPMI 2013)
Deciding whether additional information is required:
Assessing the stability of methylation across generations of Arabidopsis lab strains (Nature 2011)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 23 / 76
Epistasis: Computational Bottlenecks
Scale of the problem
Typical datasets include order 105 − 107 SNPs.
Hence we have to consider order 1010 − 1014 SNP pairs.
Enormous multiple hypothesis testing problem.
Enormous computational runtime problem.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 24 / 76
Epistasis: Common approaches in the literature
Exhaustive enumeration
Only with special hardware such as Cloud Computing or GPU implementations (e.g.
Kam-Thong et al., EJHG 2010, ISMB 2011, Hum Her 2012)
Filtering approaches
Statistical criterion, e.g. SNPs with large main effect (Zhang et al., 2007)
Biological criterion, e.g. underlying PPI (Emily et al., 2009)
Index structure approaches
fastANOVA, branch-and-bound on SNPs (Zhang et al., 2008)
TEAM, efficient updates of contingency tables (Zhang et al., 2010)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 25 / 76
Multi-Locus Models: Discovering Trait-Related Interactions
A AA
A
A
CT
C
CG
G
C
A AA
A
A
CT
G
CG
G
C
A AA
A
A
AT
C
CG
G
C
Problem statement
Find the pair of SNPs most correlated with a binary phenotype
argmax(i ,j)
|r(xi � xj , y)|
xi and xj represent one SNP each and y is the phenotype; xi , xj , y are all m-dimensionalvectors, given m individuals.
There can be up to n = 107 SNPs, and order 1014 SNP pairs.
Existing approaches: low recall or worst-case O(n2) timeDepartment Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 26 / 76
Difference in Correlation for Epistasis Detection
We phrase epistasis detection as a difference in correlation problem:
argmaxi ,j
|ρcases(xi , xj)− ρcontrols(xi , xj)|. (1)
Different degree of linkage disequilibrium of two loci in cases and controls
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 27 / 76
The Lightbulb Algorithm (Paturi et al., COLT 1989)
Maximum correlation
The lightbulb algorithm tackles the maximum correlation problem on an m× n matrix Awith binary entries:
argmaxi ,j
|ρA(xi , xj)|. (2)
Quadratic runtime algorithm
As in epistasis detection, the problem can be solved by naive enumeration of all n2
possible solutions.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 28 / 76
The Lightbulb Approach
Lightbulb algorithm
1 Given a binary matrix A with m rows and n columns.
2 Repeat l times:
Sample k rowsIncrease a counter for all pairs of columns that match on these k rows.
3 The counters divided by l give an estimate of the correlation P(xi = xj).
Subquadratic runtime
With probability near 1, the lightbulb algorithm retrieves the most correlated pair in
O(n1+
ln c1ln c2 ln2 n), where c1 and c2 are the highest and second highest correlation score.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 29 / 76
Difference Between the Epistasis and Lightbulb Problem Setting
Discrepancies
Difference in correlation
SNPs are non-binary in general
Pearson’s correlation coefficient
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 30 / 76
Step 1: Difference in Correlation
Theorem
Given a matrix of cases A and a matrix of controls B of identical size.
Finding the maximally correlated pair on(A AB 1− B
)(3)
and on (A 1− AB B
)(4)
is identical to
argmaxi ,j
|ρA(xi , xj)− ρB(xi , xj)|. (5)Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 31 / 76
Step 2: Locality Sensitive Hashing (Charikar, 2002)
Given a collection of vectors in Rm we choose a random vector r from the m-dimensionalGaussian distribution. Corresponding to this vector r, we define a hash function hr asfollows:
hr(xi ) =
{1 if r>xi ≥ 0
0 if r>xi < 0(6)
Theorem
For vectors xi , xj , Pr [hr(xi ) = hr(xj)] = 1−θ(xi , xj)
π, where θ(xi , xj) is the angle
between the two vectors xi and xj .
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 32 / 76
Step 3: Pearson’s Correlation Coefficient
Link between correlation and cosine
Karl Pearson defined the correlation of 2 vectors xi , xj in Rm as
ρ =cov(xi , xj)
σxiσxj
, (7)
that is the covariance of the two vectors divided by their standard deviations. Anequivalent geometric way to define it is:
ρ = cos(θ(xi − xi , xj − xj)), (8)
where xi and xj are the mean value of xi and xj , respectively.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 33 / 76
The Lightbulb Epistasis Algorithm (Achlioptas et al., KDD 2011)
Algorithm
1 Binarize original matrices A0 and B0 into A and B by locality sensitive hashing.
2 Compute maximally correlated pair p1 on
(A AB 1− B
)via lightbulb.
3 Compute maximally correlated pair p2 on
(A 1− AB B
)via lightbulb.
4 Report the maximum of p1 and p2.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 34 / 76
Experiments: Arabidopsis SNP dataset
Results on Arabidopsis SNP dataset# SNPs Measurements Pairs Exponent Speedup Top 10 Top 100 Top 500 Top 1K
100,000 8,255,645 8,186,657 1.38 611 1.00 0.86 0.82 0.80100,000 52,762,001 51,732,700 1.54 97 1.00 1.00 0.99 0.98
Runtime
Runtime is empirically O(n1.5).
Epistasis detection on the human genome would require 1 day of computation on atypical desktop PC.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 35 / 76
Experiments: Runtime versus Recall
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.981.3
1.35
1.4
1.45
1.5
1.55
1.6
Recall among top 1000 SNP pairs (in %)
Exp
on
en
t o
f ru
ntim
e (
ba
se
n)
dbgap Schizophrenia dataset
Hapsample simulated dataset
Arabidopsis thaliana dataset
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 36 / 76
Multi-Locus Models: Current Work
Other important aspects
Including prior knowledge on relevance of SNPs (Limin Li et al., ISMB 2011)
Accounting for relatedness of individuals (Rakitsch et al., Bioinformatics 2012)
Measuring statistical significance (Sugiyama et al., arxiv 2014)
Modelling correlations between multiple phenotypes (Rakitsch et al., NIPS 2013)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 37 / 76
Increasing Sample Size: Genotyping (Cao et al., Nat. Gen. 2011)
Coding UTR Intron TE Intergenic0
1
2
3
4
5
1 20 40 60 80
a
b c
Frac
tion
Annotation Accessions
Milli
on S
NPs
d e
2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0Missing genotypes (%)
96.5
97.0
97.5
98.0
98.5
99.0
Bur-0C24Kro-0Ler-1
500
1,000
1,500
2,000
2,500
3,000
10 20 40 60 80Accessions
00
30 50 70
Freq
uenc
y
Pred
ictio
n ac
cura
cy (%
)
0
0.1
0.2
0.3
0.4
0.5Accessible sitesSNPsSmall indelsSV deletions
Coding UTR Intron TE Intergenic0
1
2
3
4
5
1 20 40 60 80
a
b c
Frac
tion
Annotation Accessions
Milli
on S
NPs
d e
2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0Missing genotypes (%)
96.5
97.0
97.5
98.0
98.5
99.0
Bur-0C24Kro-0Ler-1
500
1,000
1,500
2,000
2,500
3,000
10 20 40 60 80Accessions
00
30 50 70
Freq
uenc
y
Pred
ictio
n ac
cura
cy (%
)
0
0.1
0.2
0.3
0.4
0.5Accessible sitesSNPsSmall indelsSV deletions Setup
80 fully sequences genomes fromA. thaliana (3 million SNPs)
4 strains with 250.000 SNPs
Can we predict the remainingSNPs?
Result
Employed BEAGLE to predictmissing SNPs in 4 strains
Missing sites can be accuratelypredicted (>96% accuracy)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 38 / 76
Increasing Sample Size: Phenotyping (Karaletsos et al., Bioinf. 2012)
Setup
Guppy image collections
Re-occurring color patterns arephenotypes
How to phenotype the guppiesautomatically?
Result
Proposed Markov Random Fieldfor pattern discovery
Recovers color patterns found bymanual annotation
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 39 / 76
Increasing Sample Size: Phenotyping (Feragen et al., NIPS 2013c)
Setup
Collections of CT-scans of humanlungs
Structural differences may belinked to disease (COPD)
How to measure differences inlung structure?
Result
Proposed novel, efficient similaritymeasure on geometric trees (treekernel)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 40 / 76
Additional Factors: Epigenetic Influences (Becker et al., Nature 2011, Hagmann et al.,
PLoS Genetics, 2015)
Founder plant
Generation 0
Generation 3
Generation 31
Generation 32
29 39 49 59 69 79 89 99 109 119
4 8
Setup
33 generations of lab strains of A.thaliana
How stable is the methylationstate of genome positions acrossgenerations?
Result
Position-specific methylationvaries greatly
Region-wide methylation is morestable
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 41 / 76
An Online Resource for Machine Learning on Complex Traits
We published easyGWAS (https://easygwas.org/), a machine learning platform foranalysing complex traits (Grimm et al., arXiv 2012):
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 42 / 76
The Evolution of Bioinformatics
Future of Bioinformatics: Personalized Medicine
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 43 / 76
Personalized Medicine: Biomarker Discovery
Personalized Medicine
Tailoring medical treatment to the molecular properties of a patient
Biomarker Discovery
Detecting molecular components that are indicative of disease outbreak, progression ortherapy outcome
Biomarker
The term ‘biomarker’, short for ‘biological marker’, refers to a broad subcategory of medicalsigns — that is, objective indications of medical state observed from outside the patient —which can be measured accurately and reproducibly (Strimbu and Tavel, 2010).
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 44 / 76
Personalized Medicine: Where We Stand
Producing molecular data: Sequencing costs
USD 300,000,000 cost of sequencing a human genome in 2001USD 1,000 cost of sequencing a human genome in 2014
Storing molecular data: Electronic health records
29% of U.S. physicians used an electronic health system in 200693% reported actively using medical records in 2013
Using molecular data: Products
13 prominent examples of personalized medicine drugs, treatments and diagnostics productsavailable in 2006113 prominent examples of personalized medicine drugs, treatments and diagnostics productsavailable in 2014
Source: http://www.ageofpersonalizedmedicine.org/
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 45 / 76
Significant Pattern Mining
Karsten Borgwardt
ETH Zurich Fraunhofer-Institut Kaiserslautern, September 30, 2016
Department Biosystems
Biomarker Discovery
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 46 / 76
Biomarker Discovery as a Pattern Mining Problem
Finding groups of disease-related molecular factors
Single genetic variants, gene expression levels, protein abundancies are often notsufficiently indicative of disease outbreak, progression or therapy outcome.
Searching for combinations of these molecular factors creates an enormous search space,and two inherent problems:
1 Computational level: How to efficiently search this large space?2 Statistical level: How to properly account for testing an enormous number of hypotheses?
The vast majority of current work in this direction (e.g. Achlioptas et al., KDD 2011)focuses on Problem 1, the computational efficiency.
But Problem 2, multiple testing, is also of fundamental importance!
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 47 / 76
Biomarker Discovery as a Pattern Mining Problem
Feature Selection: Find features that distinguish classes of objects
Pattern Mining: Find higher-order combinations of binary features, so-called patterns,to distinguish one class from another
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 48 / 76
Pattern Mining
Definition of a pattern
A pattern is a set of dimensions S .
A binary vector x contains a pattern S if∏i∈S
xi = 1, where xi is dimension i of x.
Definition of a frequent pattern
Assume we are given n vectors {xj}nj=1.
If at least θ vectors contain pattern S , then S is a frequent pattern.
θ is a pre-defined frequency threshold.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 49 / 76
Pattern Mining
Frequent pattern mining
Analogous definitions for frequent subgraph mining and frequent substring mining.
Numerous branch-and-bound algorithms have been proposed for finding frequentpatterns (Aggarwal and Han, 2014).
Problem statement: Significant Pattern Mining
Assume we are given n vectors {xj}nj=1, each of which has a binary class label yj ∈ 0, 1.
Our goal is to find all patterns S that are statistically significant enriched in one ofthe two classes y = 0 or y = 1.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 50 / 76
Statistical Significance and Testability
Fisher’s exact test
Contingency Table
|S ∈ x| |S /∈ x|y = 0 a n0 − a n0
y = 1 x − a n − n0 − x + a n − n0
x n − x n
A popular choice is Fisher’s exact test to test whether S is overrepresented in one of thetwo classes.
The common way to compute p-values for Fisher’s exact test is based on thehypergeometric distribution and assumes fixed total marginals (x , n0, n).
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 51 / 76
Statistical Significance and Testability
Multiple testing correction in pattern mining
The number of candidate patterns grows exponentially with the cardinality of thepattern.
If we do not correct for multiple testing, α per cent of all candidate patterns will be falsepositives.
If we do correct for multiple testing, e.g. via Bonferroni correction (α
#tests), then we
lose any statistical power.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 52 / 76
Statistical Significance and Testability
Multiple testing correction in pattern mining
The number of candidate patterns grows exponentially with the cardinality of thepattern.
If we do not correct for multiple testing, α per cent of all candidate patterns will be falsepositives.
If we do correct for multiple testing, e.g. via Bonferroni correction (α
#tests), then we
lose any statistical power.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 52 / 76
Statistical Significance and Testability
Multiple testing correction in pattern mining
The number of candidate patterns grows exponentially with the cardinality of thepattern.
If we do not correct for multiple testing, α per cent of all candidate patterns will be falsepositives.
If we do correct for multiple testing, e.g. via Bonferroni correction (α
#tests), then we
lose any statistical power.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 52 / 76
Statistical Significance and Testability
Tarone’s trick
Tarone’s insight: When working with discrete test statistics (e.g. Fisher’s exact test),there is a minimum p-value that a given pattern can obtain, based on its total frequency.
Tarone’s trick (1990): Ignore those patterns in multiple testing correction, for which theminimum p-value is larger than the Bonferroni-corrected significance threshold.
If the p-values are conditioned on the total marginals (e.g. in Fisher’s exact test),Tarone’s trick does not increase the Family Wise Error rate.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 53 / 76
Statistical Significance and Testability
Tarone’s trick
Tarone’s insight: When working with discrete test statistics (e.g. Fisher’s exact test),there is a minimum p-value that a given pattern can obtain, based on its total frequency.
Tarone’s trick (1990): Ignore those patterns in multiple testing correction, for which theminimum p-value is larger than the Bonferroni-corrected significance threshold.
If the p-values are conditioned on the total marginals (e.g. in Fisher’s exact test),Tarone’s trick does not increase the Family Wise Error rate.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 53 / 76
Statistical Significance and Testability
Tarone’s trick
Tarone’s insight: When working with discrete test statistics (e.g. Fisher’s exact test),there is a minimum p-value that a given pattern can obtain, based on its total frequency.
Tarone’s trick (1990): Ignore those patterns in multiple testing correction, for which theminimum p-value is larger than the Bonferroni-corrected significance threshold.
If the p-values are conditioned on the total marginals (e.g. in Fisher’s exact test),Tarone’s trick does not increase the Family Wise Error rate.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 53 / 76
Mining Significant Patterns
Tarone’s approach (1990)
For a discrete test statistics T (S) for a pattern S , such as in Fisher’s exact test, there isa minimum obtainable p-value, pmin(S).
For some S , pmin(S) >α
m. Tarone refers to them as untestable hypotheses U .
Tarone’s strategy: Ignore untestable hypotheses U when counting the number of testsm for Bonferroni correction.
If the p-values of the test are conditioned on the total marginals (as in Fisher’s exacttest), this does not affect the Family-Wise Error Rate.
Difficulty: There is an interdependence between m and U .
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 54 / 76
Mining Significant Patterns
Tarone’s approach (1990)
For a discrete test statistics T (S) for a pattern S , such as in Fisher’s exact test, there isa minimum obtainable p-value, pmin(S).
For some S , pmin(S) >α
m. Tarone refers to them as untestable hypotheses U .
Tarone’s strategy: Ignore untestable hypotheses U when counting the number of testsm for Bonferroni correction.
If the p-values of the test are conditioned on the total marginals (as in Fisher’s exacttest), this does not affect the Family-Wise Error Rate.
Difficulty: There is an interdependence between m and U .
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 54 / 76
Mining Significant Patterns
Tarone’s approach (1990)
For a discrete test statistics T (S) for a pattern S , such as in Fisher’s exact test, there isa minimum obtainable p-value, pmin(S).
For some S , pmin(S) >α
m. Tarone refers to them as untestable hypotheses U .
Tarone’s strategy: Ignore untestable hypotheses U when counting the number of testsm for Bonferroni correction.
If the p-values of the test are conditioned on the total marginals (as in Fisher’s exacttest), this does not affect the Family-Wise Error Rate.
Difficulty: There is an interdependence between m and U .
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 54 / 76
Mining Significant Patterns
Tarone’s approach (1990)
Assume k is the number of tests that we correct for.
m(k) is the number of testable hypotheses at significance levelα
k.
Then the optimization problem is
min k
s. t. k ≥ m(k)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 55 / 76
Mining Significant Patterns
Tarone’s approach (1990)
Assume k is the number of tests that we correct for.
m(k) is the number of testable hypotheses at significance levelα
k.
procedure Tarone
k := 1;while k < m(k) do
k := k + 1;
return k
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 56 / 76
Mining Significant Patterns
Terada’s link to frequent itemset mining (Terada et al., PNAS 2013)
For 0 ≤ x ≤ n1, the minimum p-value pmin(S) decreases monotonically with x .
One can use frequent itemset mining to find all S that are testable at level α, withfrequency ψ−1(α).
They propose to use a decremental search strategy:
procedure Terada’s decremental search (LAMP)
k := ”very large”;while k > m(k) do
k := k − 1;m(k) := frequent itemset mining(D, ψ−1(
α
k));
return k + 1
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 57 / 76
Mining Significant Patterns
0 n1n2
n2 n
x
10−35.0
10−30.0
10−25.0
10−20.0
10−15.0
10−10.0
10−5.0
1
Min
imum
att
ain
able
p-v
alu
e o
f x
log(Ψ(x))
log(Ψ(x))
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 58 / 76
Example: PTC dataset (Helma et al., 2001)
Maximum size of subgraph nodes
Corr
ectio
n fa
ctor
5 10 15 limitless
1000
100000
10000000
BonferroniReduced
Maximum size of subgraph nodes
Num
ber o
f sig
ni�c
ant s
ubgr
aphs
5 10 15 limitless
0
1
2
3
4 BonferroniReduced
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 59 / 76
Significant Subgraph Mining (Sugyiama et al., SDM 2015)
Significant Subgraph Mining
Each object is a graph.
A pattern is a subgraph in these graphs.
Typical application in Drug Development: Find subgraphs that discriminate betweenmolecules with and without drug effect.
Counting all tests (= all patterns) requires exponential runtime in the number of nodes.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 60 / 76
Significant Subgraph Mining (Sugyiama et al., SDM 2015)
Incremental search with early stopping
procedure Incremental search with early stopping
θ := 0repeat
θ := θ + 1; FSθ := 0;repeat
find next frequent subgraph at frequency θFSθ := FSθ + 1
until (no more frequent subgraph found) or (FSθ >α
ψ(θ))
until FSθ ≤α
ψ(θ)return ψ(θ)
α
ψ(θ)is the maximum correction factor, such that subgraphs with frequency θ can be significant at level ψ(θ).
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 61 / 76
Significant Subgraph Mining on PTC Dataset
Maximum size of subgraph nodes
Corr
ectio
n fa
ctor
5 10 15 limitless
1000
100000
10000000
BonferroniReduced
Maximum size of subgraph nodes
Num
ber o
f sig
ni�c
ant s
ubgr
aphs
5 10 15 limitless
0
1
2
3
4 BonferroniReduced
Maximum size of subgraph nodes
Runn
ing
time
5 10 15 limitless
0.1
10
1000
100000Brute-forceDecrementalIncremental
Dataset from Helma et al. (2001)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 62 / 76
Significant Subgraph Mining: Correction Factor
Corr
ectio
n fa
ctor
5 10 15 Limitless 5 10 15 Limitless 5 10 15
Corr
ectio
n fa
ctor
5 10 15 Limitless 5 10 15 Limitless 5 10 15
Max. size of subgraph nodes
Corr
ectio
n fa
ctor
5 10 15 LimitlessMax. size of subgraph nodes
5 10 15 Limitless
PTC(MR) MUTAG ENZYMES
D&D NCI1 NCI41
NCI167 NCI220
103
105
107
102
104
106
102
104
106
105
106
107
108
109
103
105
107
109
103
105
107
109
104105106107108
103
105
107
109
Max. size ofsubgraph nodes
Bonferroni
E�ectiveTestable
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 63 / 76
Significant Subgraph Mining: Runtime
Runn
ing
time
(s)
Runn
ing
time
(s)
Runn
ing
time
(s)
PTC(MR) MUTAG ENZYMES
D&D NCI1 NCI41
NCI167 NCI220
10–1
10
103
105
10–1
10
103
105
10–1
10
103
105
10–1
10
103
105
10–1
10
103
105
10–1
10
103
105
10–1
10
103
105
10–1
10
103
105
5 10 15 Limitless 5 10 15 Limitless 5 10 15
5 10 15 Limitless 5 10 15 Limitless 5 10 15
Max. size of subgraph nodes5 10 15 Limitless
Max. size of subgraph nodes5 10 15 Limitless
Max. size ofsubgraph nodes
Brute-forceOne-passDecremental
Bisection
E�ective
Incremental
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 64 / 76
Westfall-Young light (Llinares-Lopez et al., KDD 2015)
Dependence between hypotheses
As patterns are often in sub-/superpattern-relationships, they do not constituteindependent hypotheses.
Informally: The underlying number of hypotheses may be much lower than the rawcount.
Westfall-Young-Permutation tests (Westfall and Young, 1993), in which the class labelsare repeatedly permuted to approximate the null distribution, are one strategy to takethis dependence into account.
Computational problem: How to efficiently perform these thousands of permutations?
There is one existing approach, FastWY (Terada et al., ICBB 2013), which suffers fromeither memory or runtime problems.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 65 / 76
Westfall-Young light (Llinares-Lopez et al., KDD 2015)
The Algorithm
1 Input: Transactions D, class labels y, target FWER α, number of permutations jp.
2 Perform jp permutations of the class label y and store each permutation as cj .
3 Initialize θ := 1 and δ∗ := ψ(θ) and p(j)min := 1.
4 Perform a depth first search on the patterns:
Compute the p-value of pattern S across all permutations, update p(j)min if necessary.
Update δ∗ by α-quantile of p(j)min, and increase θ accordingly.
Process all children of S with frequency ≥ ψ−1(δ∗).
5 Output: Corrected significance threshold δ∗.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 66 / 76
Westfall-Young light (Llinares-Lopez et al., KDD 2015)
Speed-up tricks of Westfall-Young light
Follows incremental search strategy rather than decremental search strategy of FastWY
Performs only one iteration of frequent pattern mining
Does not store the occurrence list of patterns
Does not compute the upper 1− α quantile of minimum p-values exactly.
Reduces the number of cell counts that have to be evaluated
Shares the computation of p-values across permutations
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 67 / 76
Westfall-Young light
Runtime
PTC
(MR
)
PTC
(FR
)
PTC
(MM
)
PTC
(FM
)
MU
TA
G
EN
ZYM
ES
D&
D
NC
I1
NC
I41
NC
I109
NC
I167
NC
I220
10-1
101
103
105
107Execu
tion t
ime (
s)
FastWY Westfall-Young light
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 68 / 76
Westfall-Young light
Final frequency threshold (support)
PTC
(MR
)
PTC
(FR
)
PTC
(MM
)
PTC
(FM
)
MU
TA
G
EN
ZYM
ES
D&
D
NC
I1
NC
I41
NC
I109
NC
I167
NC
I220
0
5
10
15
20
25Fi
nal su
pport
FastWY Westfall-Young light
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 69 / 76
Westfall-Young light
Peak memory usage
PTC
(MR
)
PTC
(FR
)
PTC
(MM
)
PTC
(FM
)
MU
TA
G
EN
ZYM
ES
D&
D
NC
I1
NC
I41
NC
I109
NC
I167
NC
I220
101
102
103
104
105M
em
ory
usa
ge (
MB
)
FastWY Westfall-Young light Gaston
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 70 / 76
Westfall-Young light
Better control of the Family-wise error rate (Enzymes)
1000 4000 7000 10000Number of permutations
0.00
0.01
0.02
0.03
0.04
0.05
FW
ER
FWER FWERLAMP
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 71 / 76
Detecting “Genetic Heterogeneity” (Llinares et al., ISMB 2015b)
Cas
esC
ontro
ls
AssociatedInterval
0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0
0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0
0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0
0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0
0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0
0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0
L = Number of SNPs
n =
Num
ber o
f Sam
ples
g(s1[13, 4]) = 1g(s2[13, 4]) = 1g(s3[13, 4]) = 1g(s4[13, 4]) = 0g(s5[13, 4]) = 0g(s6[13, 4]) = 0
Random variable
Detecting intervals significantly associated with phenotypic variation
Find subsequences which tend to contain at least one minor allele in one of the twophenotypic groups
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 72 / 76
Education: Our Marie Curie Initial Training Network
Goal: Enable medical treatment tailored to patients’ molecular properties
Plan: Build a research community at the interface of Machine Learning and data-drivenMedicine
First step: Marie Curie Initial Training Network (ITN)
Topic: Machine Learning for Personalized Medicine (MLPM)Duration: 4 years, 2013-201613 early-stage researchers + 1 postdoc in 12 labs at 10 nodes in 6 countries3.75 million EUR funding for PhD students and training eventsResearch programmes:
Biomarker DiscoveryData IntegrationCausal Mechanisms of DiseaseGene-Environment Interactions
Follow us on mlpm.eu
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 73 / 76
Summary
Machine Learning in Bioinformatics
Classic Bioinformatics
Comparing graphs in Chemoinformatics
Current Bioinformatics
High-dimensional feature selection in Statistical Genetics
Future Bioinformatics
Significant pattern mining for biomarker discovery in Personalized Medicine
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 74 / 76
Thank You
Postdocs and PhD students:
Dean Bodenham
Lukas Folkman
Udo Gieraths
Thomas Gumbsch
Anja Gumpinger
Xiao He
Felipe Llinares Lopez
Laetitia Papaxanthos
Damian Roqueiro
Caroline Weis
Sponsors:
Krupp-Stiftung
Marie-Curie-FP 7
Starting Grant (SNSF’s ERC backupscheme)
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 75 / 76
Thank You
bsse.ethz.ch/mlcb
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 76 / 76
Machine Learning for Personalized Medicine
Karsten Borgwardt
ETH Zurich Fraunhofer-Institut Kaiserslautern, September 30, 2016
Department Biosystems
Main References I
C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara, K. M. Borgwardt,Bioinformatics (Oxford, England) 29, i171 (2013).
P. Achlioptas, B. Scholkopf, K. Borgwardt, ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining (KDD) (2011), pp. 726–734.
C. Becker, et al., Nature 480, 245 (2011).
J. Cao, et al., Nature Genetics 43, 956 (2011).
A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, K. M. Borgwardt, Advances inNeural Information Processing Systems 26: 27th Annual Conference on NeuralInformation Processing Systems 2013. (2013), pp. 216–224.
D. Grimm, et al., arXiv:1212.4788 (2012).
D. G. Grimm, et al., Human Mutation 36, 513 (2015).
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 2 / 5
Main References II
T. Kam-Thong, et al., Eur J Hum Genet (2010).
T. Kam-Thong, B. Putz, N. Karbalai, B. Muller-Myhsok, K. Borgwardt,Bioinformatics (ISMB) 27, i214 (2011).
T. Kam-Thong, et al., Human Heredity 73, 220 (2012).
T. Karaletsos, O. Stegle, C. Dreyer, J. Winn, K. M. Borgwardt, Bioinformatics 28,1001 (2012).
L. Li, B. Rakitsch, K. Borgwardt, Bioinformatics (Oxford, England) 27, i342 (2011).PMID: 21685091.
F. Llinares-Lopez, M. Sugiyama, L. Papaxanthos, K. M. Borgwardt, Proceedings ofthe 21th ACM SIGKDD International Conference on Knowledge Discovery and DataMining, Sydney, NSW, Australia, August 10-13, 2015 , L. Cao, et al., eds. (ACM,2015), pp. 725–734.
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 3 / 5
Main References III
F. Llinares-Lopez, et al., Bioinformatics 31, 240 (2015).
N. Shervashidze, K. M. Borgwardt, NIPS , Y. Bengio, D. Schuurmans, J. Lafferty,C. K. I. Williams, A. Culotta, eds. (MIT Press, Cambridge, MA, 2009), pp.1660–1668.
M. Sugiyama, K. M. Borgwardt, Advances in Neural Information Processing Systems26: 27th Annual Conference on Neural Information Processing Systems 2013. (2013),pp. 467–475.
M. Sugiyama, C. Azencott, D. Grimm, Y. Kawahara, K. M. Borgwardt, Proceedingsof the 2014 SIAM International Conference on Data Mining, Philadelphia,Pennsylvania, USA, April 24-26, 2014 (2014), pp. 199–207.
M. Sugiyama, F. Llinares-Lopez, N. Kasenburg, K. M. Borgwardt, SIAM Data Mining(2015).
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 4 / 5
Main References IV
R. E. Tarone, Biometrics 46, 515 (1990).
A. Terada, M. Okada-Hatakeyama, K. Tsuda, J. Sese, Proceedings of the NationalAcademy of Sciences 110, 12996 (2013).
A. Terada, K. Tsuda, J. Sese, IEEE International Conference on Bioinformatics andBiomedicine (2013), pp. 153–158.
P. H. Westfall, S. S. Young, Statistics in Medicine 13, 1084 (1993).
Department Biosystems Karsten Borgwardt ITWM Kaiserslautern September 30, 2016 5 / 5