Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Finding patterns, correlations, and descriptors in materials data using subgroup discovery and compressed sensing
Bryan R. GoldsmithUniversity of Michigan, Ann Arbor
Department of Chemical Engineering
Christopher J. Bartel and Charles MusgraveCU Boulder, Department of Chemical and Biological Engineering
Chris Sutton, Runhai Ouyang, Luca M. Ghiringhelli, Matthias SchefflerFritz Haber Institute of the Max Planck Society, Theory Department
Mario Boley and Jilles VreekenMax Planck Institute for Informatics
Predicting advanced materials requires understanding the mechanisms underlying their function
[1] K. Kang et al. Science 311, 977 (2006)
Battery materials
Catalysts
[1]
Identifying physically meaningful descriptors can aid materials discovery
Screen new catalystsIncrease understandingDescriptor = function(atomic or material features)
Descriptor → Property
Ma, Xianfeng and HongliangXin PRL 118.3 (2017) 036101.
Nørskov, Jens K., et al. PNAS108.3 (2011) 937-943.
Identifying physically meaningful descriptors can aid materials discovery
Screen new catalystsIncrease understandingDescriptor = function(atomic or material features)
Descriptor → Property
Ma, Xianfeng and HongliangXin PRL 118.3 (2017) 036101.
Nørskov, Jens K., et al. PNAS108.3 (2011) 937-943.
Machine learning tools can acceleratethe discovery of descriptors
This talk focuses on two data-analytics tools to find descriptors of materials
1. Compressed sensing to find low-dimensional descriptors- Perovskite oxides and halides
2. Subgroup discovery to find local patterns and their descriptions- Gold clusters in the gas phase (sizes 5-14 atoms)- Octet binary (AB) semiconductors
3. Future work: Compressed sensing and subgroup discovery for catalysis
Compressed sensing allows the construction of sparse models with high accuracy
Original image Sparse in the basis set
Recovered with 10% measurements
Emmanuel Candès, Terence Tao, and David Donoho
Part 1. Compressed sensing to find interpretable descriptors
min 𝛽𝛽 0 subject to 𝑦𝑦 = 𝑫𝑫𝛽𝛽 coefficients
Matrix of the materials’ features
Targetproperty
Compressed sensing allows construction of sparse models with high accuracy
𝒍𝒍𝟎𝟎-norm: total # ofnon-zero coefficients
Ideally use 𝑙𝑙0-norm minimization
Emmanuel Candès, Terence Tao, and David Donoho
𝑙𝑙0-norm minimization is too expensive to perform for large feature matrix D!
�̂�𝛽𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿(λ) = argmin𝛽𝛽
12 𝑦𝑦 − 𝑫𝑫𝛽𝛽 2
2 + λ 𝛽𝛽 1
Instead often minimize 𝑙𝑙1-norm (LASSO)as approximation of 𝑙𝑙0-norm
𝒍𝒍𝟏𝟏-norm:Sum of absolute value of coefficients
Root mean squared error
Regularizationparameter
Example of LASSO+𝑙𝑙0 : Find a descriptor that predicts the crystal structure energy differences between the 82 octet AB compounds
LASSO+𝑙𝑙0: L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler, PRL 114, 105503 (2015)
Unfortunately LASSO has stability issues for a huge feature space of correlated features
This issue has been solved recently using theSure Independence Screening Sparsifying Operator (SISSO) algorithm
R. Ouyang et al., “SISSO: a compressed-sensing method for systematically identifying efficient physical models of materials properties” arxiv (2017)
Runhai Ouyang
Managing high dimensional and correlated feature spaces by combining screening and compressed sensing
SISSO overviewStep 1. Systematically construct a huge feature space
Primary features
Feature transformation
Huge feature space (billions)operator set
�𝑅𝑅 = {+,−,×,÷, exp, 𝑙𝑙𝑙𝑙𝑙𝑙,−1 ,2 ,3 , 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠, − }
Step 2. Select top ranked features using Sure Independent Screening (SIS)[1]
Sure independent screening Select the N largest components of 𝑫𝑫𝑻𝑻y
𝑦𝑦: target property𝑫𝑫: matrix of the features for each material
[1] J. Fan and J. Lv, J. R. Statist. Soc. B 70, 849 (2008)
Results in a subspace of N features that are most correlated with the property of interest
Step 3. Iteratively apply sure independent screeningwith a sparse approximation algorithm
(billions)Huge feature space 𝑫𝑫
…𝐿𝐿1 𝐿𝐿1 �𝐿𝐿2
Minimize 𝑙𝑙0
𝐿𝐿𝑛𝑛 �𝑖𝑖=1
𝑛𝑛−1𝐿𝐿𝑖𝑖
Minimize 𝑙𝑙01𝐷𝐷 descriptor 2𝐷𝐷 descriptor 𝑛𝑛𝐷𝐷 descriptor
SIS𝑦𝑦 (𝑫𝑫) SISR1(𝑫𝑫) SISR𝑛𝑛−1(𝑫𝑫)
Minimize 𝑙𝑙0
SIS = sure independent screening
Si = feature subspace
𝑦𝑦 = target property
1012
~3000
Ri = Residual of target property usingthe previous iterations least squares prediction
SISSO: R. Ouyang et al. arxiv (2017) Also can use domain overlap for classification problems
Perovskites – promising functional materials
Perovskites are a class of ABX3 materials A typically group 1, 2, or lanthanide B typically transition metal X typically chalcogen or halogen
Special Issue: Perovskites, Science 2017
SISSO applied to perovskites to find a descriptor for their stability
Goldschmidt’s tolerance factor (t) to predict stability
𝑠𝑠 =rA + rX
2(rB + rX)
If cubic, a = 2 rB + rX = 2 rA + rX ; 𝑠𝑠 = 1
Viktor Goldschmidt (1926)
Can we find a better descriptor using SISSO?
Correa-Baena et al., Science 2017
Dataset of experimentally characterized ABX3
576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X)
Experimental results compiled from:H. Zhang, N. Li, K. Li, D. Xue, Acta Cryst. B 2007C. Li, X. Lu, W. Ding, L. Feng, Y. Gao, Z. Guo, Acta Cryst. B 2008W. Travis, E. Glover, H. Bronstein, D. Scanlon, R. Palgrave, Chem. Sci. 2016
t is often insufficient, especially for halides 𝑠𝑠 =𝑠𝑠𝐴𝐴 + 𝑠𝑠𝑋𝑋
2(𝑠𝑠𝐵𝐵 + 𝑠𝑠𝑋𝑋)576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X) Only 74% accuracy
on experimental set
t is often insufficient, especially for halides 𝑠𝑠 =𝑠𝑠𝐴𝐴 + 𝑠𝑠𝑋𝑋
2(𝑠𝑠𝐵𝐵 + 𝑠𝑠𝑋𝑋)
Only 74% accuracy on experimental set
576 ABX3 with experimental XRD 313 perovskites and 263 nonperovskites 75 different cations (A, B) 5 different anions (X)
t ~ guessing for heavier halides
New tolerance factor discovered with SISSO (compressed sensing)
𝝉𝝉 = 𝒓𝒓𝑿𝑿𝒓𝒓𝑩𝑩− 𝒏𝒏𝑨𝑨 𝒏𝒏𝑨𝑨 −
𝒓𝒓𝑨𝑨/𝒓𝒓𝑩𝑩𝒍𝒍𝒏𝒏 𝒓𝒓𝑨𝑨/𝒓𝒓𝑩𝑩
92% accuracy
requires the same inputs as for the calculation of t
from ~3,000,000,000 potential descriptors
τ is general for oxides and halides
t ~ uniform performance 92% accuracy
Comparing the forms of t and τ
𝜏𝜏 = 𝑟𝑟𝑋𝑋𝑟𝑟𝐵𝐵− 𝑛𝑛𝐴𝐴 𝑛𝑛𝐴𝐴 −
𝑟𝑟𝐴𝐴/𝑟𝑟𝐵𝐵𝑙𝑙𝑛𝑛 𝑟𝑟𝐴𝐴/𝑟𝑟𝐵𝐵
𝑠𝑠 =𝑠𝑠𝐴𝐴 + 𝑠𝑠𝑋𝑋
2(𝑠𝑠𝐵𝐵 + 𝑠𝑠𝑋𝑋)
1 2 3 1 2
0.825 < t < 1.059 → perovskite τ < 4.18 → perovskite 92% accuracy74% accuracy
Decision tree outputs only (-1, 1)
1J. Platt, Advances In large margin classifiers, 1999
Mapping decision tree outputs to logistic regression1 yields ℘(τ)
Monotonic perovskite probabilities – ℘(τ)
Monotonic perovskite probabilities – ℘(τ)
1J. Platt, Advances In large margin classifiers, 1999
Decision tree outputs only (-1, 1)
Mapping decision tree outputs to logistic regression1 yields ℘(τ)
℘(τ) compares well with DFT-GGA ΔHdec
ΔHdec > 0 → stable in cubic structure
88% agreement
℘ correlates with ΔHdec
τ can be more powerful (CaZrO3, CaHfO3)
Decomposition enthalpies from:X.-G. Zhao, D. Yang, Y. Sun, T. Li, L. Zhang, L. Yu, A. Zunger, JACS 2017Q. Sun, W.-J. Yin, JACS 2017
Double perovskites for emerging solar absorbers
A2BB’X6
9 recently synthesized double perovskite
halides – all predicted to be perovskite by τ
2016
2016
2017
2017
… and more every month
Lower triangle –Cs2BB’Cl6
Upper triangle –(CH3NH3)2BB’Br6
τ applied to 259,296 A2BB’X6 compounds
SISSO to find new descriptor, τ,which improves upon Goldschmidt’s
C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700) github.com/CJBartel/perovskite-stability
𝜏𝜏 = rXrB− nA nA −
rA/rBln rA/rB
New tolerance factor for perovskite stability using SISSO
SISSO to find new descriptor, τ,which improves upon Goldschmidt’s
τ yields meaningful probabilities
New tolerance factor for perovskite stability using SISSO
C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700) github.com/CJBartel/perovskite-stability
New tolerance factor for perovskite stability using SISSO
SISSO to find new descriptor, τ,which improves upon Goldschmidt’s
τ yields meaningful probabilities
Stability elucidated as ℘(A, B, X)
C. Bartel et al., Under review (https://arxiv.org/abs/1801.07700)
SISSO should be applicableto catalysis problems
Typically one focuses on creating a global prediction model for some property of interest (e.g., SISSO, Kernel Ridge Regression, Neural Networks)
Underlying mechanisms canchange across materials
Relations between subsets of data may be important
material property 1 (y1)
mat
eria
l pro
pert
y 2
(y2)
Part 2. Subgroup discovery to find local patterns and their descriptions
material property 1 (y1)
mat
eria
l pro
pert
y 2
(y2)
The periodic table has subgroups
Subgroup discovery: find meaningful local descriptors of a target property in materials-science data
Review: M. Atzmueller, WIREs Data Min. Knowl. Disc. 5 (2015)B. R. Goldsmith, M. Boley, J. Vreeken, M. Scheffler, L. M. Ghiringhelli, New J. Phys. 19, (2017)M. Boley, B. R. Goldsmith, L. M. Ghiringhelli, J. Vreeken, Data Min. Knowl. Disc. 1391, (2017)
Subgroup discovery focuses on local observations
Decision trees Subgroup discovery
Descriptive features, 𝑎𝑎1, … ,𝑎𝑎𝑚𝑚 ∈ 𝐿𝐿
e.g., d-band center, coordination number, atomic radii
Target features 𝑦𝑦1, … ,𝑦𝑦𝑛𝑛 ∈ 𝑌𝑌
e.g., adsorbate binding energy
Basic selectors, 𝑐𝑐1, … , 𝑐𝑐𝑘𝑘 ∈ 𝐶𝐶 → {false, true}
e.g., Is the d-band center low?; Is the atom coordination number < 3?
Find selector 𝜎𝜎 = 𝑐𝑐1 ⋅ ∧ ⋯∧ 𝑐𝑐𝑙𝑙 ⋅
that maximizes quality 𝑠𝑠 = ext 𝜎𝜎𝑃𝑃
𝛼𝛼𝑢𝑢(𝑌𝑌𝜎𝜎)1−𝛼𝛼
ext 𝜎𝜎𝑃𝑃
is the coverage of points where 𝜎𝜎 is true
𝑢𝑢 𝑌𝑌𝜎𝜎 is the utility function (optimization criteria)
Subgroup discovery: how it works
1. Gas Phase Gold Clusters (Au5-Au14)
2. Classification and description of 82 Octet Binary Semiconductors
Two tutorial applications of subgroup discovery presented
24,400 gold cluster configurations in total
Display interesting catalytic properties
Rocksalt Zincblende
vs.
24,400 gold cluster configurations (Au5-Au14) in the gas phase
HOM
O-L
UM
O e
nerg
y ga
p (e
V)
(relative) total energy of gold cluster (eV)
Three main subgroups identified HOM
O-LU
MO
energy gap (eV)Probability
𝐞𝐞𝐞𝐞𝐞𝐞𝐞𝐞(𝑵𝑵) ∧ 𝑵𝑵 ≥ 𝟕𝟕
𝑵𝑵 = 𝟔𝟔
𝐨𝐨𝐨𝐨𝐨𝐨(𝑵𝑵)
Rediscover simple insight about HOMO-LUMO gap
Choose target propertyHOMO-LUMO energy gap
Choose variation reduction utility function𝑢𝑢 𝑌𝑌′ = (std 𝑌𝑌 − std 𝑌𝑌′ )/std(𝑌𝑌)
Find descriptors that predict crystal structuresfor the 82 octet AB-type materials
Target property sign(Erocksalt – Ezincblende)
Input candidate descriptors into subgroup discovery from DFT calculations• Radii of s, p, d orbitals of free atoms • Electron affinity• Ionization potential….and othersB. R. Goldsmith et. al., New J. Phys. 19, (2017)
M. Boley et. al., Data Min. Knowl. Disc. 1391, (2017)
Rocksalt Zincblende
vs.
𝒓𝒓𝐩𝐩𝐀𝐀 − 𝒓𝒓𝐩𝐩𝐁𝐁 ≥ 𝟎𝟎.𝟗𝟗𝟏𝟏 Åand 𝒓𝒓𝐬𝐬𝐀𝐀 ≥ 𝟏𝟏.𝟐𝟐𝟐𝟐 Å
𝒓𝒓𝐩𝐩𝐀𝐀 − 𝒓𝒓𝐩𝐩𝐁𝐁 ≤ 𝟏𝟏.𝟏𝟏𝟔𝟔 Åand 𝒓𝒓𝐬𝐬𝐀𝐀 ≤ 𝟏𝟏.𝟐𝟐𝟕𝟕 Å
𝑠𝑠pA − 𝑠𝑠pB
𝑠𝑠 sA
Subgroup discovery classifies 79 of the 82 compoundsusing a two-dimensional descriptor
zincblende subgroup
Rocksaltsubgroup
MgTe AgI
CuF
Computationalscreening
First-principles microkinetic modeling
Catalyst designCatalyst characterization
Computationalspectroscopy
First-principlesthermochemistry &kinetic parameters
Reaction mechanism
Mechanistic understanding
Active sites
Looking ahead: Can we use data analytics to extract catalyst descriptors?
DescriptorData analyticsSISSOSubgroup Discovery….
Machine learninge.g., use subgroup discovery and compressed sensing relate features to Figures of merit
HeterogeneousCatalyst Feature Space Figures of Merit
atomicproperties
electronic structurephysical
properties activity
stability
selectivity
(1)
(2)
Catalyst Space
support
composition
size
Machine learning for catalyst design and discovery
Acknowledgements
Mario BoleyRunhai OuyangChristopher SuttonJilles Vreeken
Matthias SchefflerLuca M. GhiringhelliCharles Musgrave
Subgroup discoveryand SISSO tutorials are onlinehttps://www.nomad-coe.eu/
Christopher Bartel