18
A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3 , John C. Lattanzio 1 , Aldeida Aleti 2 , David L. Dowe 2 , Joss Bland-Hawthorn 4,5,6 , Sven Buder 3,7,8 , Geraint F. Lewis 4 , Sarah L. Martell 9 , Thomas Nordlander 5,7 , Jeffrey D. Simpson 9 , Sanjib Sharma 4 , and Daniel B. Zucker 10 1 School of Physics & Astronomy, Monash University, Wellington Road, Clayton 3800, Victoria, Australia; [email protected] 2 Faculty of Information Technology, Monash University, Wellington Road, Clayton 3800, Victoria, Australia 3 ARC Centre of Excellence for All Sky Astrophysics in 3 Dimensions (ASTRO 3D), Canberra, ACT 2611, Australia 4 Sydney Institute for Astronomy, School of Physics, A28, The University of Sydney, NSW 2006, Australia 5 Center of Excellence for Astrophysics in Three Dimensions (ASTRO-3D), Australia 6 Miller Professor, Miller Institute, UC Berkeley, Berkeley, CA 94720, USA 7 Research School of Astronomy and Astrophysics, Australian National University, Canberra, ACT 2611, Australia 8 Max Planck Institute for Astronomy (MPIA), Koenigstuhl 17, 69117 Heidelberg, Germany 9 School of Physics, University of New South Wales, Sydney, NSW 2052, Australia 10 Department of Physics and Astronomy, Macquarie University, Sydney, NSW 2109, Australia Received 2019 July 22; revised 2019 September 30; accepted 2019 October 17; published 2019 December 11 Abstract Chemical tagging seeks to identify unique star formation sites from present-day stellar abundances. Previous techniques have treated each abundance dimension as being statistically independent, despite theoretical expectations that many elements can be produced by more than one nucleosynthetic process. In this work, we introduce a data- driven model of nucleosynthesis, where a set of latent factors (e.g., nucleosynthetic yields) contribute to all stars with different scores and clustering (e.g., chemical tagging) is modeled by a mixture of multivariate Gaussians in a lower- dimensional latent space. We use an exact method to simultaneously estimate the factor scores for each star, the partial assignment of each star to each cluster, and the latent factors common to all stars, even in the presence of missing data entries. We use an information-theoretic Bayesian principle to estimate the number of latent factors and clusters. Using the second Galah data release, we nd that six latent factors are preferred to explain N=2566 stars with 17 chemical abundances. We identify the rapid- and slow neutron-capture processes, as well as latent factors consistent with Fe-peak and α-element production, and another where K and Zn dominate. When we consider N160,000 stars with missing abundances, we nd another seven factors, as well as 16 components in latent space. Despite these components showing separation in chemistry, which is explained through different yield contributions, none show signicant structure in their positions or motions. We argue that more data and joint priors on cluster membership that are constrained by dynamical models are necessary to realize chemical tagging at a galactic-scale. We release accompanying software that scales well with the available data, allowing for the models parameters to be optimized in seconds given a xed number of latent factors, components, and 10 7 abundance measurements. Unied Astronomy Thesaurus concepts: Bayesian statistics (1900); Chemical abundances (224); Galaxy chemical evolution (580) 1. Introduction The detailed chemical abundances that are observable in a stars photosphere provide a fossil record that carries with it information about where and when that star formed (Freeman & Bland-Hawthorn 2002). While the photospheric abundances remain largely unchanged throughout a stars lifetime (however see Dotter et al. 2017; Ness et al. 2018), the dynamical dissipation timescale of open clusters in the Milky Way disk is of the order of a few gigayears (Portegies Zwart et al. 1998). This makes chemical tagging an attractive approach to identify star formation sites long after those stars are no longer gravitationally bound to each other. Gravitationally bound star clusters have been useful labora- tories for testing the limits and utility of chemical tagging. Although biases arise when only considering star clusters that are still gravitationally bound, the chemical homogeneity of open clusters provides an empirical measure of how similar stars would need to be before they could be tagged as belonging to the same star formation site (Bland-Hawthorn et al. 2010a, 2010b; Mitschang et al. 2014). However, there are analysis issues in understanding precisely how these chemical abundances can be measured (Bovy 2016) and how chemically similar stars may not have formed together (dopplegängers; Ness et al. 2018). If open clusters were truly chemically homogeneous, then under idealistic assumptions our ability to chemically tag the Milky Way would depend primarily on the precision with which we can measure those chemical abundances in stars. Although data- driven approaches to modeling stellar spectra can improve upon this precision (Ness et al. 2015, 2018; Casey et al. 2016a, 2017; Ho et al. 2017a, 2017b; Leung & Bovy 2019; Ness 2018; Ting et al. 2019), more work is needed because astronomers have not yet developed unbiased estimators of chemical abundances that saturate the Cramér-Rao bound (Rao 1945; Cramér 1946). Chemical tagging experiments require a catalog of precise chemical abundance measurements for a large number of stars, where those chemical abundances trace different nucleosynthetic pathways. This is the primary goal of the Galactic Archaeology with HERMES (Galah) survey (De Silva et al. 2015; Martell et al. 2017; Buder et al. 2018), which is a stellar spectroscopic survey that uses the High Efciency and Resolution Multi- Element Spectrograph (HERMES; Sheinis et al. 2015) on the The Astrophysical Journal, 887:73 (18pp), 2019 December 10 https://doi.org/10.3847/1538-4357/ab4fea © 2019. The American Astronomical Society. All rights reserved. 1

A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensionalLatent Space

Andrew R. Casey1,2,3 , John C. Lattanzio1 , Aldeida Aleti2, David L. Dowe2, Joss Bland-Hawthorn4,5,6 , Sven Buder3,7,8 ,Geraint F. Lewis4 , Sarah L. Martell9 , Thomas Nordlander5,7 , Jeffrey D. Simpson9 , Sanjib Sharma4 , and

Daniel B. Zucker101 School of Physics & Astronomy, Monash University, Wellington Road, Clayton 3800, Victoria, Australia; [email protected]

2 Faculty of Information Technology, Monash University, Wellington Road, Clayton 3800, Victoria, Australia3 ARC Centre of Excellence for All Sky Astrophysics in 3 Dimensions (ASTRO 3D), Canberra, ACT 2611, Australia

4 Sydney Institute for Astronomy, School of Physics, A28, The University of Sydney, NSW 2006, Australia5 Center of Excellence for Astrophysics in Three Dimensions (ASTRO-3D), Australia

6 Miller Professor, Miller Institute, UC Berkeley, Berkeley, CA 94720, USA7 Research School of Astronomy and Astrophysics, Australian National University, Canberra, ACT 2611, Australia

8 Max Planck Institute for Astronomy (MPIA), Koenigstuhl 17, 69117 Heidelberg, Germany9 School of Physics, University of New South Wales, Sydney, NSW 2052, Australia

10 Department of Physics and Astronomy, Macquarie University, Sydney, NSW 2109, AustraliaReceived 2019 July 22; revised 2019 September 30; accepted 2019 October 17; published 2019 December 11

Abstract

Chemical tagging seeks to identify unique star formation sites from present-day stellar abundances. Previoustechniques have treated each abundance dimension as being statistically independent, despite theoretical expectationsthat many elements can be produced by more than one nucleosynthetic process. In this work, we introduce a data-driven model of nucleosynthesis, where a set of latent factors (e.g., nucleosynthetic yields) contribute to all stars withdifferent scores and clustering (e.g., chemical tagging) is modeled by a mixture of multivariate Gaussians in a lower-dimensional latent space. We use an exact method to simultaneously estimate the factor scores for each star, thepartial assignment of each star to each cluster, and the latent factors common to all stars, even in the presence ofmissing data entries. We use an information-theoretic Bayesian principle to estimate the number of latent factors andclusters. Using the second Galah data release, we find that six latent factors are preferred to explain N=2566 starswith 17 chemical abundances. We identify the rapid- and slow neutron-capture processes, as well as latent factorsconsistent with Fe-peak and α-element production, and another where K and Zn dominate. When we considerN∼160,000 stars with missing abundances, we find another seven factors, as well as 16 components in latent space.Despite these components showing separation in chemistry, which is explained through different yield contributions,none show significant structure in their positions or motions. We argue that more data and joint priors on clustermembership that are constrained by dynamical models are necessary to realize chemical tagging at a galactic-scale.We release accompanying software that scales well with the available data, allowing for the model’s parameters to beoptimized in seconds given a fixed number of latent factors, components, and ∼107 abundance measurements.

Unified Astronomy Thesaurus concepts: Bayesian statistics (1900); Chemical abundances (224); Galaxy chemicalevolution (580)

1. Introduction

The detailed chemical abundances that are observable in astar’s photosphere provide a fossil record that carries with itinformation about where and when that star formed (Freeman& Bland-Hawthorn 2002). While the photospheric abundancesremain largely unchanged throughout a star’s lifetime (howeversee Dotter et al. 2017; Ness et al. 2018), the dynamicaldissipation timescale of open clusters in the Milky Way disk isof the order of a few gigayears (Portegies Zwart et al. 1998).This makes chemical tagging an attractive approach to identifystar formation sites long after those stars are no longergravitationally bound to each other.

Gravitationally bound star clusters have been useful labora-tories for testing the limits and utility of chemical tagging.Although biases arise when only considering star clusters that arestill gravitationally bound, the chemical homogeneity of openclusters provides an empirical measure of how similar starswould need to be before they could be tagged as belonging to thesame star formation site (Bland-Hawthorn et al. 2010a, 2010b;Mitschang et al. 2014). However, there are analysis issues in

understanding precisely how these chemical abundances can bemeasured (Bovy 2016) and how chemically similar stars maynot have formed together (dopplegängers; Ness et al. 2018). Ifopen clusters were truly chemically homogeneous, then underidealistic assumptions our ability to chemically tag the MilkyWay would depend primarily on the precision with which we canmeasure those chemical abundances in stars. Although data-driven approaches to modeling stellar spectra can improve uponthis precision (Ness et al. 2015, 2018; Casey et al. 2016a, 2017;Ho et al. 2017a, 2017b; Leung & Bovy 2019; Ness 2018; Tinget al. 2019), more work is needed because astronomers have notyet developed unbiased estimators of chemical abundances thatsaturate the Cramér-Rao bound (Rao 1945; Cramér 1946).Chemical tagging experiments require a catalog of precise

chemical abundance measurements for a large number of stars,where those chemical abundances trace different nucleosyntheticpathways. This is the primary goal of the Galactic Archaeologywith HERMES (Galah) survey (De Silva et al. 2015; Martellet al. 2017; Buder et al. 2018), which is a stellar spectroscopicsurvey that uses the High Efficiency and Resolution Multi-Element Spectrograph (HERMES; Sheinis et al. 2015) on the

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 https://doi.org/10.3847/1538-4357/ab4fea© 2019. The American Astronomical Society. All rights reserved.

1

Page 2: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

3.9m Anglo-Australian Telescope (AAT). Galah will observeup to 106 stars in the Milky Way, and it will measure up to 30chemical abundances for each star (Bland-Hawthorn &Sharma 2016). This includes light odd-Z elements (e.g., Na,K), elements produced through alpha-particle capture (e.g., Mg,Ca), and elements produced through the slow (e.g., Ba) andrapid neutron-capture process (e.g., Eu). No other current orplanned spectroscopic survey provides an equivalent set ofchemical abundances for a comparable number of stars.

Given these data and the most favorable assumptions inchemical tagging—i.e., that star clusters are truly chemicallyhomogeneous, that we can measure those abundances withinfinite precision, and that those abundances are differentiablebetween star clusters—then chemical tagging becomes a cluster-ing problem. All clustering techniques applied to chemicaltagging thus far have assumed that the data dimensions areindependent. In other words, adding a dimension of say [Ni/H]provides independent information that could not have beenpredicted from other elemental abundances. However, theory andobservations agree that this cannot be true. Nucleosyntheticprocesses produce multiple elements in varying quantities, andthe effective dimensionality of stellar abundance datasets hasbeen shown to be lower than the actual number of abundancedimensions (Ting et al. 2012; Milosavljevic et al. 2018;Price-Jones & Bovy 2018). Any clustering approach that treatseach new elemental abundance as an independent axis ofinformation will therefore conclude with biased inferences aboutthe star formation history of our Galaxy.

It is not trivial to confidently estimate the nucleosyntheticyields that have contributed to the chemical abundances of eachstar. There are qualitative statements that can be made for largenumbers of stars, or particular types of stars, but quantifyingthe precise contribution of different processes to each star is anunsolved problem. For example, the so-called [α/Fe] “knee” inabundance ratios in the Milky Way can qualitatively beexplained by core-collapse supernovae being the predominantnucleosynthetic process in the early Milky Way before Type Iasupernovae made a significant contribution, but efforts to datehave not sought to try to explain the detailed abundancesof stars as a contribution of yields from different systems(however, see West & Heger 2013). This is in part because ofthe challenging and degenerate nature of the problem asdescribed, and is complicated by the differences in yieldpredictions that account from prescriptions used in differenttheoretical models.

New approaches to chemical tagging are clearly needed.Immediate advances would include methods that take thedependence among chemical elements into account withinsome generative model, or techniques that combine chemicalabundances with dynamical constraints to place joint priorprobabilities on whether any two stars could have formed fromthe same star cluster, given some model of the Milky Way.

In this work, we focus on the former. Here, we present a newapproach to chemical tagging that allows us to identify thelatent (unobserved) factors that contribute to the chemicalabundances of all stars (e.g., nucleosynthetic yields), whilesimultaneously performing clustering in the latent space.Notwithstanding caveats that we will discuss in detail, thisallows us to infer nucleosynthetic yields rather than strictlyprescribe them from models. Moreover, the scale of theclustering problem reduces by a significant fraction because theclustering is performed in a lower-dimensional latent space

instead of the higher dimensional data space. In Section 2, wedescribe the model and the methods we use to estimate themodel parameters. Section 3 describe the experiments performedusing generated and real datasets. We discuss the results of theseexperiments in Section 4, including the caveats with the modelas described. We conclude in Section 5.

2. Methods

Latent factor analysis is a common statistical approach fordescribing correlated observations with a lower number oflatent variables (e.g., Thompson 2004). Related techniquesinclude principal component analysis (Hotelling 1933) and itsvariants (Tipping & Bishop 1999), singular value decomposi-tion (Golub & Reinsch 1970), and other matrix factorizationmethods. While factor analysis on its own is a usefuldimensionality reduction tool to identify latent factors thatcontribute to the chemical abundances of stars (e.g., Ting et al.2012; Milosavljevic et al. 2018; Price-Jones & Bovy 2018),factor analysis cannot describe clustering in the data (or latent)space. Consequently, some works have performed clusteringand then required different latent factors for each (totallyassigned) component (e.g., Edwards & Dowe 1998). Similarly,clustering techniques applied to chemical abundances to date(e.g., Hogg et al. 2016) do not account for the lower effectivedimensionality in elemental abundances.Here we expand on a variant of factor analysis that is known

elsewhere as a mixture of common factor analyzers (Baek et al.2010), where the data are generated by a set of latent factors thatare common to all data but the scoring (or extent) of those factorsis different for each data point and the data can be modeled as amixture of multivariate normal distributions in the latent space(factor scores). In this work data X is a D×Nmatrix, where N isthe number of stars and D is the number of chemical abundancesmeasured for each star. We assume a generative model for thedata

( )m= + +X LS e 1

where L is a D×J matrix of factor loads that is common to alldata points, J is the number of latent factors, and the factorscores for the nth data point

( ) ( )x W~ S , 2n k k

are drawn from11 the kth multivariate normal distribution,where each kth component has a J -dimensional mean and adense J×J covariance matrix. The mean vector m describesthe mean datum in each dimension. The factor scores for alldata points S are then a J×N matrix, where each data pointhas a partial association to the components in latent space. Weassume that ( ( ))~ e D0, diag is independent of the latentspace, and D is a vector of variances in each D abundancedimensions. In this model, each data point can be representedas being drawn from a mixture of multivariate normalcomponents, except the components are clustered in the latentspace S and projected into the data space by the factor loads L.In a sense, we use latent factor analysis as a form ofdimensionality reduction and we simultaneously performingclustering in latent space (Figure 1).

11 To clarify nomenclature across disciplines, the terminology ( )~ z 0, 1indicates that the z variable is drawn from a standard normal distribution.

2

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 3: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

We assume that the latent space is lower dimensionality thanthe data space (i.e., <J D). Within the context of stellarabundances, the factor loads L can be thought of as the meanyields of nucleosynthetic events (e.g., s-process productionfrom AGB stars averaged over initial mass function and starformation history), and the factor scores are analogous to therelative counts of those nucleosynthetic events. The clusteringin factor scores achieves the same as a clustering procedurein data space, except that we simultaneously estimate thelatent processes that are common to all stars (the so-calledfactor loads, analogous to nucleosynthetic yields). Within thisframework, a rare nucleosynthetic event can still be describedas a “factor load” Lj, but its rarity would be represented byassociated factor scores being zero for most stars and thus haveno contribution to the observed abundances. In practice, thefactor loads can only be identified up to orthogonality andcannot be expressly interpreted as nucleosynthetic yieldsbecause they have limited physical meaning (we discuss thisfurther in Section 4), but this description of typical yields andrelative event rates should help build intuition for the modelparameters and provide context within the astrophysicalproblem where it is being applied.

Including latent factors in the model description allows us toaccount for processes that affect multiple elemental abun-dances. In this way, we are able to account for the fact that thedata dimensions are not independent of each other. Anotherbenefit is the scaling with computational cost. If we considereddatasets of order 107.5 entries (e.g., 30 chemical abundances for106 stars) purely as a clustering problem, then even the mostefficient clustering algorithms would incur a significantcumulative computational overhead by searching the parameterspace for the number of clusters and the optimal modelparameters given that number of components. However,because the mixture of factor analyzers approach assumes thatthere is a lower-dimensional latent space in which the data areclustered and that clustering is projected into real space bycommon factor loads, the dimensionality of the clusteringproblem is reduced from N×D to N×J. This reduces

computational cost through faster execution of each optim-ization step, and on average fewer optimization steps needed toreach a specified convergence threshold.From a statistical standpoint, the primary advantage to using

a mixture of factor analysers is that we can simultaneouslyestimate latent factors (e.g., infer nucleosynthetic yields) andperform clustering (e.g., chemical tagging) within a statisticallyconsistent framework. In particular, we have a generative data-driven model that can quantitatively describe nucleosyntheticyields; the factor scores can explain the variance in turbulenceand gas mixing, or star formation efficiency; and the parametersof this model can be simultaneously estimated in a self-consistent way with a single scalar-justified objective function.Without loss of generality the density of the mean-subtracted

data m-X (which we hereafter will refer to simply as Y ) canbe described as

( ) ( ( )) ( )å xp fY W= +=

Y Y L L L Df ; ; , diag 3k

K

k k k1

given J common factor loadings and K components clusteredin the latent (factor score) space. Here, the parameter vector Yincludes { }p x WL D, , , , , and ( )qf Y; describes the density ofa multivariate Gaussian distribution, and pk describes therelative weighting of the k th component in latent space andpå = 1k . The log likelihood is then given by

( ∣ ) ( ) ( )åY Y==

Y Yflog log ; . 4n

N

1

The model as described is indeterminate in that there is nounique solution for the factor loads L and scores S. Thesequantities can only be determined up until orthogonality in L.However, as we will describe in Section 2.2, with suitablepriors on Y one can efficiently estimate the model parametersusing the expectation-maximization algorithm (Dempster et al.1977).

Figure 1. Schematic that visualizes the model components. The data are shown on the left-hand side in grayscale, and magnified on the right-hand side where each star(row) is colored by its identified component in latent space. For each chemical abundance (column) there is a mean value and variance that is independent of the latentspace. The latent factors L are analogous to nucleosynthetic yields and are common to all stars. The factor scores S have an entry for each yield, for each star (row).The latent scores are modeled by a mix of multivariate normal distributions of K components, which are colored accordingly. The matrices of factor scores S andfactor loads L are visualized at the bottom. For clarity, here we show transposed matrices (e.g., see Equation (1)).

3

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 4: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

2.1. Initialization

Here we describe how the model parameters are initialized.12

To initialize the factor loads L, we start by randomly drawing aD×D matrix from a Haar distribution (Haar 1933), which isuniform on the special orthogonal group ( )SO n and thereforeguaranteed to return an orthogonal matrix with a determinant ofunity (Stewart 1980). We denote the J×D left-most region13

of this matrix to be our initial guess of L, which provides a setof mutually orthogonal vectors.

We then initially assign each data point as belonging to oneof the K components using the k-means++ algorithm (Arthur& Vassilvitskii 2007) in the latent space. Given the initial factorloads and assignments, we then estimate the relative weightsp, the mean factor scores of each component x, and thecovariance matrix of factor scores of each component W.Finally, we initialize the specific variance D in each dimensionas the variance in each data dimension. Other initializationmethods for the latent factors include singular value decom-position (Golub & Reinsch 1970) or generating random noisewith orthogonal constraints. Random assignment is an alter-native method that is available for initializing assignments.

Throughout this work, we repeat this initialization procedure25 times for every trial of J and K for a given dataset. We thenrun expectation-maximization (Section 2.2) from each initi-alization until the log likelihood improves by less than 10−5 perstep, and we adopt the model with the highest log likelihood asthe preferred model given that trial of J , K , and the data.Although this optimization procedure is not convex, in practiceit is normally sufficient to initialize from many points to avoidlocal minima.

2.2. Expectation-maximization

We use the expectation-maximization algorithm to estimatethe model parameters (Dempster et al. 1977). With eachexpectation step, we evaluate the log likelihood given themodel parameters14 Y, the message length, and the N×Kresponsibility matrix t whose entries are the posteriorprobability that the n th data point is associated to the k thcomponent, given the data Y and the current estimate of theparameter vector Y:

( ( ))( ( ))

( )xx

tp f

p fW

W=

+

å +=

Y L L L D

Y L L L D

; , diag

; , diag. 5nk

k n k k

gG

g n g g1

In the maximization step, we update our estimates of theparameters Y, conditioned on the data Y and the responsibilitymatrix t . The updated parameters estimates are found by settingthe second derivative of the log likelihood (Equation (4)) to zeroand solving for the parameter values.15 This guarantees thatevery updated estimate of the model parameters will increasethe log likelihood. Although there are no guarantees against

converging on local minima, in practice it is sufficient to runexpectation-maximization from multiple initializations (as wedo) to ensure that the global minimum is reached. At themaximization step, we first update our estimate of the relativeweights ( )p +t 1 given the responsibility matrix t

( )( ) åp t=+

=N

16k

t

n

N

nk1

1

where the ( )Y t superscript refers to the current parameterestimates and ( )Y +t 1 refers to the updated estimate for the nextiteration. The updated estimates of the mean factor scores ( )x +t 1

for each component are then given by

( )( )( ) ( )

( ) ( )

( )x xx t

p= +

-++

G Y L

N7k

tkt

tkt

k

kt

11

where:

( ) ( )( )W= -W 8kt 1

( ) ( )( )= -V D 9t 1

( ( ) ) ( )( ) ( )= + -C W L VL 10t t 1

[ ( ) ] ( )( ) ( ) ( ) ( )W= - G V VL C VL L . 11t t tkt

The covariance matrices of the components of factor scores( )W +t 1 are updated next,

( ) ( ) ( )( ) ( ) ( )( )t

pW W= - ++

+

I G L

G Z Z G

N12k

t tkt k

kt

11

where

( )( ) ( )x= -Z Y L . 13tkt

After some linear algebra, updated estimates of the commonfactor loads ( )+L t 1 can be found from

( ) ( )( ) =+ -L L L I 14ta b

1 1

where:

[ ( ) ] ( )( )å t x t= +=

L Y G Z G 15k

K

k kt

ka1

[ ( ( ) )] ( )( ) ( ) ( ) ( )å x xp W= +=

+ + + + L N 16k

K

kt

kt

kt

kt

b1

1 1 1 1

Finally, the updated estimate of the specific variances ( )+D t 1

are given by

( ) ( ) ( )( ) ( ) ( )⎡⎣⎢⎢

⎤⎦⎥⎥ å åt= -+

= =

+ +D Y Y L L LN

117t

k

K

kj

Jt t1

1 1

1b

1

where e denotes the entry-wise product. Throughout this workwe assume that the data are noiseless and we do not add anyobserved errors to the constructed covariance matrices.

2.3. Missing Data

The expectation-maximization procedure as described requiresthat there be no missing data entries to update our estimates ofthe responsibility matrix t and the model parameters Y. Inpractice, however, there will often be abundance measurementsthat are missing for some subset of stars. There are manypotential reasons for this, including astrophysical explanations(e.g., an absorption line was not present above the noise),

12 This describes the default initialization approach. Other approaches areavailable in the accompanying software.13 The region choice is arbitrary. All that is required is that the randomlygenerated matrix have mutually orthogonal vectors.14 When evaluating the log likelihood, we use the precision (sparse inverse)matrix of the Cholesky decomposition of the covariance matrix forcomputational efficiency and stability.15 Strictly this introduces a statistical inconsistency in that we should updateour parameter estimates by setting the second derivative of our information-theoretic objective function (Equation (35)) to zero instead of the loglikelihood, but this inconsistency only becomes serious with small N (e.g.,≈30)—precisely the opposite situation of chemical tagging!

4

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 5: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

observational limitations (e.g., the signal-to-noise ratio was toolow, or contamination by a cosmic ray), or various other reasonsthat cannot be inferred from the available information.

In this work, we will assume that any missing datameasurements are missing at random. The missing data pointscan then be treated as unknown parameters that must be solvedfor (and updated) at each iteration. Initially we impute zeros formissing data entries in Y , and at each iteration we update theseimputed value with our estimate of what the missing datavalues are given the current model parameters. This ensuresthat the log likelihood increases with each iteration. Similarly,with each update we inflate our estimates of the specificvariances based on the fraction of missing data points in eachdimension

( )( ) ( ) ⎛⎝⎜

⎞⎠⎟=

-+ +D D

N

N M18d

tdt

d

1 1

where Md is a the number of missing data entries in the dthdimension. In Section 3.2, we show with a toy model that thelatent factor loads and scores can be reliably estimated even inthe presence of high fractions of missing data (e.g., 40%),conditioned on our assumption that the data are missing atrandom.

2.4. Model Selection

The expectation-maximization algorithm as described requiresa specified number of latent factors J and K . In the next section,we describe a toy model using generated data where we willassume that the true number of latent factors and components arenot known. We require some heuristic to decide how many latentfactors and components are preferred given some data. Anincreasing number of factors and components will undoubtedlyincrease the log likelihood of the model given the data, but thelog likelihood does not account for the increased modelcomplexity that is afforded by those additional latent factorsand components.

One criterion commonly employed for evaluating a class ofmodels is the Bayesian Information Criterion (BIC; Schwarz1978)

( ∣ ) ( )Y= - YQ NBIC log 2 log 19

where Q is the number of parameters in this model

[ ( ) ( )] ( )= - + + + + -QJ

D J K J K D2

2 3 1 20

which includes -K 1 weights (as på = 1k ), D specificvariances, the -KJ J2 free parameters needed to uniquelydefine the mutually orthogonal factor loads matrix L (Baeket al. 2010), K×J parameters for the mean scores x, and

( )+KJ J 11

2parameters to encode the K full J×J covariance

matrices W.While the BIC does include a penalization term for the

number of parameters (which scales with Nlog ), it does notdescribe for the increased flexibility that is afforded by theaddition of those parameters. For example, adding oneparameter to a model will increase the BIC by at most Nlog ,but there are different ways for a single parameter to beintroduced. In a fictitious model ( )=y f x , a parameter b couldbe added that is a scalar multiple of x or it could be introducedas x b. Despite the difference in model complexity, the same

penalization occurs in BIC. Even if the log likelihood wereonly to improve marginally in both cases, the difference inmodel complexity is not captured by BIC. In other words, thereare situations where we are more interested in balancing themodel complexity (or the expected Fisher information andsimilar properties) with the goodness of fit, instead ofpenalizing the number of parameters.For these reasons, we use the Minimum Message Length

(MML; Wallace 2005) principle as a criterion for modelselection and evaluation. The classically described principle ofMML is that the best explanation of the data given a model isthe one that leads to the shortest so-called two-part message(Wallace 2005), where a message takes into account both thecomplexity of the model and its explanatory power. Thecomplexity of the model is described through the first part ofthe message, and the second part of the message describesits explanatory power. The length of each message part isquantified (or estimated) using information theory, allowing fora fair evaluation between different models of varying complex-ity or explanatory power. MML has been shown to performwell on a variety of empirical analyses (see, e.g., Wallace &Dowe 1994; Edwards & Dowe 1998; Viswanathan et al. 1999;Wallace & Dowe 2000; Fitzgibbon et al. 2004; Wallace 2005;Dowe et al. 2007; Dowe 2008, 2011). Arguments about thestatistical consistency (i.e., as the number of data pointsincreases, the distributions of the estimates become increas-ingly concentrated near the true value) of MML are given inDowe & Wallace (1997), Dowe (2011). The MML principlerequires that we explicitly specify our prior beliefs on themodel parameters, providing a Bayesian optimization approachwhich can be applied across entire classes of models.The message must encode two parts: the model, and the data

given the model. The encoding of the message is based onShannon’s information theory(Shannon 1948). The informa-tion gained from an event e occurring, where p(e) is theprobability of that event, is ( ) ( )= -I e p elog2 . The informationcontent is largest for improbable outcomes, and smallest foroutcomes that we are almost certain about. In other words, anoutcome that has a probability close to unity has nearly zeroinformation content because almost nothing new is learnedfrom it, whereas rarer events convey a much higher informationcontent.In practice calculating the message length can be a non-

trivial task, especially for models that are reasonably complex.This can make the strict MML principle intractable in manycases and necessitates approximations to the message length(however, see Wallace & Freeman 1987; Wallace & Dowe1999; Wallace 2005). Using a Taylor expansion, a generalizedscheme can be calculated to estimate the parameter vectorY that minimizes the message length ( )Y YI , (Wallace &Freeman 1987),

( ) ( )∣ ( )∣

( ∣ ) ( )

⎛⎝⎜⎜

⎞⎠⎟⎟kY Y

Y

Y

= -

- +

Y

Y

IQ p

Q

,2

log log

log2

21

Q

where ( ∣ )Y Ylog is the familiar log likelihood, ( )Yp is the jointprior density on Y, ( )Y is the matrix whose entries are theexpected second order partial derivatives of the log likelihood,

5

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 6: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

commonly referred to as the expected Fisher information matrix,

( ) ( ∣ ) ( )⎡⎣⎢

⎤⎦⎥Y

YY= -

¶¶

YE log 222

2

and as before Q is the number of model parameters. Continuousparameters can only be stated to finite precision, which leads tothe klogQ

Q2term that gives a measure of the volume of the

region of uncertainty in which the parameters Y are centered.The klog Q term can be reasonably approximated by

( )k p p g= - + - -Q

Qlog log 21

log 1 23Q

where γ is Euler’s constant.Like the BIC, the message length is penalized by the number

of model parameters through the klog Q term. However, themodel complexity is also described through the priors and theFisher information matrix, which describes the curvature ofthe log likelihood with respect to the model parameters. Forthese reasons, MML provides a more accurate description ofthe model complexity (or flexibility) because it naturallyincludes the curvature of the log likelihood with respect to themodel parameters rather than only penalizing models based onthe number of parameters.

We will describe the contributions to the message length inparts. We assume the priors on the number of latent factors Jand the number of components K to be ( ) µ -p J 2 J and

( ) µ -p K 2 K , respectively, such that fewer numbers arepreferred. The optimal lossless message to encode each is(Section 6.8.2; Knorr-Held 2000),

( ) ( ) ( )= - = +I J p J Jlog log 2 constant 24

( ) ( ) ( )= - = +I K p K Klog log 2 constant. 25

Only -K 1 of the relative weights p need encoding becausepå == 1k

Kk1 . We assume a uniform prior on individual

weights,

( ) ( )! ( )p = -p K 1 26

and the Fisher information is

( ) ( )

pp

=-

=

N

27K

k

Kk

1

1

which gives the message length of the relative weights ( )pI tobe

( ) ( )∣ ( )∣

( ) ∣ ( )∣

( )!

( ) ( ) ( ) ( )

⎛⎝⎜⎜

⎞⎠⎟⎟

⎛⎝⎜

⎞⎠⎟

å

å

p pp

p p

p

p

p

=-

=- -

=- - +-

-

= - - - G

=

=

Ip

p

KK

N

I K N K

log

log1

2log

log 11

2log

1

2log

1

21 log log log . 28

k

K

k

k

K

k

1

1

We assume uniform priors for the component means in latentspace x, where the bounds are large enough outside the rangeof observable values such that those priors are proper

(integrable)—a necessary condition for the MML principle—and only add constant terms to the message length, which canbe ignored. We assume a conjugate inverted Wishart prior forthe component covariance matricesW (Wallace & Dowe 1994,2000; Knorr-Held 2000),

( ) ∣ ∣ ( )( )x W Wµ +p , . 29k k kJ1

21

We approximate the determinate of the Fisher information ofa multivariate normal ∣ ( )∣x W , as ∣ ( )∣∣ ( )∣x W (Oliver et al.1996; Figueiredo & Jain 2002) where

∣ ( )∣ ( ) ∣ ∣ ( )x p W= - N 30kJ

k1

∣ ( )∣ ( ) ∣ ∣ ( )( ) ( )pW W= + - - + N 2 31kJ J J

kN1 11

2

such that

( ) ( ) ∣ ( )∣

( ) ( )

( ) ∣ ∣ ( )

å å

å

å

x x x

x p

W W W

W

W

=- +

= + -

- +

= =

=

=

I p

I J J NKD

J

, log ,1

2log ,

,1

43 log

2log 2

1

22 3 log . 32

k

K

k kk

K

k k

k

K

k

k

K

k

1 1

1

1

Previous work on multiple latent factor analysis within thecontext of MML have addressed the indeterminacy betweenthe factor loads and factor scores by placing a joint prior on theproduct of factor loads and scores (Wallace 1995; Edwards &Dowe 1998; Wallace 2005). However, adopting the same priordensity in our model is not practical because it would requirethe priors ( ∣x t pp , ) and ( ∣t pWp , ); that is, we would require aprior density on both the means x and covariance matricesW inlatent space that requires knowledge about the responsibilitymatrix t and relative weights p in order to estimate theeffective scores S for each data point and calculate a joint prioron the product of the factor loads L and factor scores S. Insteadwe address this indeterminacy by placing a prior on L thatensures it is mutually orthogonal. Specifically, we adopt aWishart distribution with scale matrix W and D degrees offreedom for the J×J matrix = M L L. In other words,

( )~M WW D,J and ( )= W LCov . This Wishart joint priordensity gives highest support for mutually orthogonal vectors,

( )( ) ∣ ∣∣ ∣

( ) ( )( ) ⎡

⎣⎢⎤⎦⎥=

G-

- --

L

L L

WW L Lp

2exp

1

2Tr . 33

D J

D

1

2

1DJ D

12

2 2

Thus the message length to encode L is given by

( ) ( ( ) ) ( )

∣ ∣

∣ ( )∣ ( )⎜ ⎟⎛⎝

⎞⎠

= - - -

´ +

+ - G

-

L L L L

L L

L

I

DJ

DD

1

2Tr Cov

1

2D J 1

log1

2log 2

1

2log Cov

2. 34

1

6

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 7: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

Combining Equations (24), (25), (28), (32), and (34) withEquation (21) leads to the full message length:

( )

( ) ( ∣ ) ( )( )

∣ ( )∣

( ) ∣ ∣ ( ( ) )

∣ ∣ ( )

[ ( ) ( )]

⎜ ⎟

⎜ ⎟ ⎜ ⎟

⎛⎝

⎞⎠

⎛⎝

⎞⎠

⎛⎝

⎞⎠

å

å

p

k

Y Y

W

=- + + -

+ - +

- - - +

- + - G - G

+ + + + -

=

-

=

35

Y Y

L

L L L L L

I J J

K N D

D J

J KD

QJ D K N

, log1

44 1 log

1

2log

1

2log Cov

1

21 log Tr Cov

3

2log log

2

2log

1

22 2 log 2.

k

K

k

k

K

k

q

1

1

1

3. Experiments

3.1. A Toy Model

Here we introduce a toy model where we use generated datato verify that we recover the true model parameters given somedata and to ensure that the expectation-maximization methodyields consistent results. We generated a dataset with =N100,000 data points, each with =D 15 dimensions. Weadopted a latent dimensional space of =J 5 factor loads suchthat the vector L has shape D×J, with =K 10 clusters in thelatent space. We generated the random factor loads in the sameway that we initialize the optimization (Section 2.1). Therelative weights p are drawn from a multinomial distributionand the means of the clusters in factor scores x are drawn froma standard normal distribution. The off-diagonal entries in thecovariance matrices in factor scores W are drawn from agamma distribution ( )W G~ 1k i i, , . The variance in eachdimension D is also drawn ( )G~D 1 . The n th data point(which belongs to the k th cluster) is then generated by drawing

( )x W~ S ,n k k , projecting by the factor loads L, and addingvariance D.

We treat the generated dataset as if the number of latentfactors and components are not known. Starting with =J 1 and

=K 1, we trialled each permutation of J and K until =J 10maxand =K 20max (e.g., twice the true values of Jtrue and Ktrue).

We recorded the negative log likelihood, the BIC, and themessage length16 for each permutation of J and K . Thesemetrics are shown in Figure 2. Unsurprisingly, the negative loglikelihood decreases with increasing numbers of latent factors Jand increasing numbers of components K . The lowest BICvalue and message length is found at =J 5 and =K 10, whichis identical to the true values. We repeated this toy modelexperiment using a smaller sample size ( =N 5000) to be morerepresentative of the sample sizes in later Galah experiments(Section 3.3). The results of the grid search are also shown inFigure 2. Here BIC estimates the true number of latent factorscorrectly but tends to underestimate the true number of clusters,more so than the message length. Although the differencebetween the true number of components and that given by theshortest message length is not large, this serves to illustrate thatin this example a larger number of data points are required to“resolve” the true number of components in latent space.

It is clear from Figure 2 that a combination of latent factorsand clustering in the latent space provides a better descriptionof the (generated) data than a Gaussian mixture model withoutlatent factors. Adding components to the model does improvesthe log likelihood, even with a single latent factor, but theaddition of just one latent factor improves the log likelihood,more so than adding twenty components. Not much more canbe said for this example because the true data generatingprocess is known, although this toy model has illustrated howclustering in high dimensional data can be better described bylatent factors with clustering in the lower-dimensional latentspace.Some technical background is warranted before we compare

our estimated model parameters to the true values. We previouslystated that the latent factors in this model are only identifiable upto an orthogonal rotation. In other words, if the data were trulygenerated by latent factors Ltrue, then our estimates of those latentfactors Lest do not need to be identical to the true values. Forexample, the ordering of the estimated factors could be differentfrom the true factors, and the ordering of the dimensionality inlatent space would then be accordingly different. Since noconstraint is placed on the ordering of the factor loads duringexpectation-maximization, there is no assurance (or requirement)that our factor loads match the true factor loads.Another possibility is that the estimated factor loads could be

flipped in sign relative to the true factor loads, and the scoreswould similarly be flipped. In both of these situations(reordering or flipped signs) the log likelihood given the dataand the estimated factor loads Lest would be identical to the loglikelihood given the data and the true factor loads Ltrue, despitethe difference in ordering and sign. The same can be said forany other scalar metric (e.g., Kullback–Leibler divergence;Kullback & Leibler 1951). These examples serve to illustrate amore general property that the factor loads and factor scorescan be orthogonally rotated by any valid rotation matrix17 R.The estimated factor loads Lest could therefore appear verydifferent from the true values, but they only differ by anorthogonal rotation. We discuss the impact of this limitation onreal data in more detail in Section 4.We took the model with the preferred number of latent

factors and components found from a grid search ( =K 10,=J 5; which are also the true values) and applied an

orthogonal rotation to the latent space to be as close aspossible to the true values. The rotation matrix R was foundby solving for J unknown angle parameters, each of which isused to construct a Givens rotation matrix (Givens 1958),and we then take the product of those Givens matrices toproduce a valid rotation matrix R. This process reduces toEuler angle rotation in three or fewer dimensions. Thisprocess rotates the latent space (L, x,W) but has no effect onthe model’s predictive power: the evaluated log likelihoodor the Kullback–Leibler divergence (Kullback & Leibler1951) under the rotated model is indistinguishable fromthe unrotated model. In Figure 3 we show the estimatedfactor loads L, factor scores S, and specific variances Dcompared to the true values. The agreement is excellent in allmodel parameters.

16 Omitting constant terms such that negative message lengths are allowed. 17 Recall that a rotation matrix is valid if =RR I .

7

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 8: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

Figure 2. Metrics from our grid search for the toy model for two sample sizes: =N 100,000 (left) and =N 5000 (right). The top panels show the negative loglikelihood ( ∣ )Y- Ylog evaluated at each combination of latent factors J and number of clusters K using the generated data in our toy model. The middle panelsshows the BIC (Equation (19)) for those combinations. The lower panel shows the message length. The white marker indicates the lowest value in each panel. A lineconnects to the true value (black point) to guide the eye.

8

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 9: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

3.2. A Toy Model with Data Missing at Random

Here we repeat the toy model used in the previousexperiment, but we discard an increasing fraction of the dataand evaluate the performance and accuracy of our method inthe presence of incomplete data. We considered missing datafractions from 1% to 40%. In each case we treated the modelparameters as unknown, assumed the missing data points weremissing at random, and initialized the model as per Section 3.1.

In Figure 4, we show the results of this experiment for ourworst considered case, where 40% of the data entries arerandomly discarded. We find that despite the high fraction ofmissing entries, our estimates of the model parameters remainunbiased in this example using a toy model. The corrections toour estimates of the specific variances are sufficient in that thespecific variance in each dimension is not systematically under-estimated from the true values, even though 40% of the dataentries are missing.

3.3. The Galah Survey

In this experiment we perform blind chemical tagging usingthe photospheric abundances released as part of the secondGalah data release (Buder et al. 2018). This dataset includes upto 23 chemical abundances reported for 342,682 stars. In thisexample we chose to restrict ourselves to stars with a completeset of abundance measurements for a subset of those 23elements (i.e., no missing data entries). For example, here we

will exclude lithium and carbon abundances because thephotospheric values will vary throughout a star’s lifetime (e.g.,Casey et al. 2016b, 2019). This is true to a small degree formany elements (e.g., Dotter et al. 2017), but for the purposes ofthis experiment we assume that all other photosphericabundances remain constant throughout a star’s lifetime.We first selected stars with flag_cannon=0 to exclude

stars where there is reason to suspect that the stellar parameters(e.g., Teff , glog ) are unreliable, and as a result the detailedchemical abundances would be untrustworthy. We thenrequired all stars to have a signal-to-noise ratio exceeding30 per pixel in the blue arm (snr_c1>30). This isequivalent to a signal-to-noise ratio of about 140 per resolutionelement in the third HERMES CCD (l » 5750central Å). Werequired that stars have no erroneous flags in all of thefollowing abundances: Mg, Na, Al, Si, K, Ca, Sc, Ti, Mn, Fe,Ni, Cu, Zn, Y, Ba, La, and Eu. These elements were chosenbecause they trace multiple nucleosynthetic pathways and theyare more commonly reported in the Galah data release, whichallows for a larger number of stars with a complete abundanceinventory. There are 2566 stars that met these criteria. We notethat while our signal-to-noise ratio cut is arbitrary, it is in partmotivated by the point where systematic uncertainties start todominate in Galah results (Figure 15 of Buder et al. 2018).Systematic uncertainties per abundance can be captured by thespecific variances D in our model. A more restrictive signal-to-noise cut would reduce the sample size and restrict our ability

Figure 3. Estimated factor loads L (left), factor scores S (middle), and specific variances D (right) compared to the true data generating values for Experiment1(Section 3.1). The agreement is excellent.

Figure 4. Estimated factor loads L (left-hand), factor scores S (middle), and specific variances D (right-hand) compared to the true data generating values forExperiment2 (Section 3.2). Here 40% of the data are missing at random. The agreement remains excellent, despite the large fraction of missing data entries. Note thatthe scales on the top panels are 2–10 times larger than those in Figure 3.

9

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 10: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

to infer latent factors and components, whereas a more relaxedsignal-to-noise cut would still require that there are noerroneous flags in abundances.

We executed a grid search for the number of latent factorsJ and the number of components K that were preferred bythe data. Starting with =J 1 and =K 1, we trialled eachpermutation of J and K up until =J 7 and =K 5. The resultsof this grid search are shown in Figure 5, where we show thenegative log likelihood, the BIC, and message length found foreach permutation. The behavior of the BIC and the messagelength are very different here, unlike what was observed in ourtoy model. Here, the BIC behavior appears similar to thenegative log likelihood in that the BIC prefers highercomponents and latent factors than the extent of the grid(e.g., >J 7 and >K 5). Indeed, if we were to trial highervalues of J and K, then the negative log likelihood wouldcontinue to increase. The model with six latent factors andthree components (J= 6, K= 3) is found to have the shortestmessage length, which we take as our preferred model forthese data.

Earlier, we described how the latent factors we estimate canonly be identified up until an orthogonal rotation. If we want tointerpret the latent factors estimated from Galah data, then wemust specify some target factor loads such that we can identifywhich factors are most similar to the yields we expect. Wespecified the following target latent factors, where:

1. The first factor load should have non-zero entries in Euand La (e.g., the r-process).

2. The second factor load should have non-zero entries inBa, Y, and La (e.g., the s-process).

3. The third factor load should have non-zero entries in Fe-peak elements Mn, Fe, Ni, Zn, and Ti.

4. The fourth factor load should have non-zero entries in theodd-Z elements Na, Al, K , Sc, and Cu.

5. The fifth factor load should have non-zero entries in theα-element tracers Si, Ca, and Mg.

We initially set each non-zero entry in these target factorloads Ltarget to -E

12 , where E is the number of non-zero entries

in that factor load, to ensure that Ltarget is mutually orthogonal.We solved for the J unknown angles to produce a valid rotationmatrix R that would make our estimated loads L as close aspossible to the target loads Ltarget, and then applied that rotationto the model. The target loads and (rotated) estimated loads areshown in Figure 6. Note that the purpose of this procedure isnot to “find” the target loads that we expect but to provide aslittle information needed to identify and describe all factorloads within an astrophysical context. This procedure stillrequires that the factors be mutually orthogonal and that theydescribe the data. For these reasons, we will not always recoverthe exact target loads that we seek: we will only be able toidentify factor loads that are closest to the target loads.This is demonstrated in Figure 6, where some estimated

factor loads match closely to the target load (e.g., L2 which weidentify as the s-process) and some barely match at all (e.g.,L5). Here we show the absolute entry of the factor loadsbecause even if an entry is negative, the corresponding factorscores could also be negative and their product will contributeto the observed abundances. For this reason, the sign does notmatter here.Some of these factor load entries may be non-zero because

we require the latent factors to be mutually orthogonal, and notbecause they truly contribute to the data. To try and disentanglethese possibilities, we calculate the fractional contributionthat factor load makes to the observed abundances relative toother factor loads. We define the fractional contribution of thejth factor load to the dth data dimension as:

∣ ∣∣ ∣

( )=å

å åC

L S

L S. 36

Nj d n j

J Nj d n j

d,j, ,

, ,

The fractional contributions to each element are shown on theright-hand side of Figure 6. We identify the first factor L1 asbeing most similar to the r-process, and here it is the dominantcontributor to Eu, a typical r-process tracer. Surprisingly, wealso find that this factor load is a reasonable contributor to theodd-Z element Sc. The specific scatter in Sc is 0.03dex(Figure 7), which suggests that the Sc abundances are well-described by this latent factor model.The second latent factor L2 here is most representative of

the slow neutron-capture process (s-process), with dominantcontributions to Ba, and Y. This factor has some support atother elements, notably K. L3 is the primary contributor tonearly all Fe-peak elements, with close to negligible contribu-tions from other factors. The exception here is Cu, where anear-equal contribution comes from L4. The fifth latent factorL5 is the dominant contributor to the α-element tracers Si, Ca,and Mg, and surprisingly, Al. The specific scatter afteraccounting for these latent factors is smallest for Fe (0.01 dex)and largest for K (0.13 dex; Figure 7). The typical scatter in mostelements is about 0.05dex.In Figure 8 we show the inferred clustering in latent space,

where the separation between components is arguably best seenin the splitting between S6 with respect to S2 or S3. Whenprojected to data space (Figure 9) the third component (lightgreen) is seen to have relatively higher abundance ratios of

Figure 5. Top panel shows the negative log likelihood ( ∣ )Y- Ylog evaluatedat each combination of latent factors J and number of clusters K using Galahdata in Experiment3. The middle panel shows the BIC for those combinations.The lower panel shows the message length. The white marker indicates thelowest value in each panel, showing the preferred number of latent factors andcomponents.

10

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 11: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

[K,Ba,Zn/Fe] at a given [Fe/H]. This is consistent withclustering in latent space.

3.4. Galah Survey Data with an Increasing number of Starswith Missing Data Entries

Here we extend our experiment in Section 3.3 toprogressively include more stars, even though those stars havesome abundance measurements missing. Specifically we startedwith the same subset of 2566 stars in Section 3.3 and added arandom set of stars that met our criteria of flag_cannon=0and snr_c1>30. We initially added 1000 stars to give asample of N=3566, then repeated the grid search for thenumber of latent factors and components, and recorded the

model with the lowest message length. We then repeated thisprocedure using 10,000 stars (N=12,566), and finally usingall 157,242 stars that met the criteria of flag_cannon=0 andsnr_c1>30 to give a total sample size of N=159,808stars. For sample sizes up to N∼3566 we found that six latentfactors were preferred and these factors shared commonfeatures (Figure 10). This illustrates that the first18 set ofinferred factor loads inferred from a smaller, complete dataset,remain largely unchanged despite the increasing samplesize and the increasing number of missing data entries. When

Figure 6. Latent factors inferred from 2566 stars in Galah (Buder et al. 2018, thick lines) with 17 abundance measurements. The left-hand panels show the absoluteentries for each factor load, where the thin lines indicate the target latent factors (see Section 3.3). On the right-hand, we show the absolute fractional contributions toeach element, ordered by the loads that contribute most.

18“First” has no concept here in terms of factor load ordering, but for the

purposes of comparing inferred loads from different datasets we have orderedthe loads to be as close to those inferred in Section 3.3.

11

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 12: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

the sample size reaches N=12,566 we find that anotherlatent factor was required to best explain the data. WhenN∼159,808, the preferred number of latent factors rises to10 (J= 10).

4. Discussion

We have introduced a model to simultaneously accountfor the lower effective dimensionality of chemical abundancespace and perform clustering in that lower-dimensional space.This provides a data-driven model of nucleosynthesis yields andchemical tagging that allows us to simultaneously estimate thelatent factors that contribute to all stars and cluster those stars bytheir relative contributions from each factor. The results areencouraging in that we find latent factors that are representativeof the expected yields from dominant nucleosynthetic channels.However, the model that we describe is very likely not thecorrect model to use to represent chemical abundances of stars.In this section, we discuss the limitations of our model in detail.

We require latent factors to be mutually orthogonal to resolvean indeterminacy. This suggests an astrophysical context wherethe mean nucleosynthetic yields (integrated over all stellar massesand star formation histories) of various nucleosynthetic processes(e.g., r-process, s-process) are mutually orthogonal to each other.Clearly this assumption is likely to be incorrect: the nuclearphysics of one environment where elements are produced will bevery different from others, and there is no astrophysical constraintthat those yields (or latent factors) should be mutually orthogonal.In principle, one could represent the latent factors using ahierarchical data-driven model where the yields contribute as afunction of stellar mass, metallicity, and other factors, but inprinciple to resolve the indeterminacy in this model would stillrequire mutual orthogonality on the mean yields. Introducing aconstraint on the factor scores that resolves this indeterminacy and

allows for more flexible latent factors would be a worthyextension to this work.The constraint of mutual orthogonality limits the inferences

we want to make about stellar nucleosynthetic yields. Forexample, after accounting for all known sources of potassiumproduction in the Milky Way, galactic chemical evolutionmodels under-predict the level of K in the Milky Way by morethan an order of magnitude (Kobayashi et al. 2006). From ourinferences using Galah data, we find that L2—the factor thatwe identify as the s-process—is the dominant contributor topotassium. This latent factor persists even in the presence ofmissing data and a sample size two orders of magnitude larger.Does this suggest that the production of K is linked to theproduction of much heavier nuclei? If our model couldconfidently and reliably associate the production of K withother elements or sites, then it could help explain the peculiarabundances of stars enhanced in K and depleted in Mg (Cohen& Kirby 2012; Mucciarelli et al. 2012)—a chemical abundancepattern that currently lacks explanation (Iliadis et al. 2016;Kemp et al. 2018). In the Cohen & Kirby (2012) sample, theirhigh [K/Fe] stars also tend to be high in heavier elements butthere are also numerous abundance correlations present.However, is the K contribution that we infer physicallyrealistic? Or, is it a consequence of requiring that the latentfactors are mutually orthogonal? Distinguishing these possibi-lities is non-trivial, which is in part why caution is warrantedwhen trying to interpret latent factor models. In this situation, itis worth commenting that K has the largest specific scatter(Figure 7), which suggests that the contributions of K areperhaps not as well-described by the latent factor model asother elements. This could in part be due to the non-trivial andsignificant effects that the assumption of local thermodynamicequilibrium (LTE) has on our inferred K abundances. Thesenon-LTE effects are of order 0.5dex and will be accounted forin the upcoming Galah data release (S. Buder 2019, privatecommunication).A similar argument could be made for Sc, where L1—a

factor load that we identify as the r-process—is the primarycontributor. Sc is under-produced in galactic chemical evol-ution models relative to observations (Kobayashi et al. 2006;Casey & Schlaufman 2015). Based on this work, is theproduction of Sc linked to the production of heavy nuclei?Unlike K, the specific scatter in Sc is remarkably low: just0.03dex, among the best-described elements after Ti and Fe(0.01 dex). This would suggest that the latent factor model is avery good description for the production of Sc, but it does notprove that it is the description for the production of Sc.There are other issues in our model that relate to our

assumption of mutual orthogonality. Even if nucleosyntheticyields were truly mutually orthogonal, then the latent factorsthat we infer are only identifiable up until an orthogonal basis.As we have seen in our experiments, the ordering and sign ofthe latent factors is not described a priori. This is both a featureand a bug: unrestricted ordering and signs allow for the modelparameters to be estimated more efficiently because they canfreely rotate as the model parameters are updated, but it doesmean that we must “assign” the latent factors which we infer asbeing described by an astrophysical process (e.g., the first latentfactor is r-process). A more general limitation of this is that thelatent factors can be multiplied by some arbitrary rotation matrix,leading to latent factor loads that are very different from what

Figure 7. Specific scatter (e.g., D ) remaining in the Galah data (Buderet al. 2018) after accounting for the contributions by all latent factors.

12

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 13: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

was estimated by the model but still lead to the exact same data(or log likelihood, or Kullback–Liebler divergence, etc.).Consequently, we can only “identify” latent factors up until thisrotation. We have sought to address this by constructing rotationmatrices where the entries for each latent factor correspond toour expectations from astrophysical processes (while remainingorthogonal), but here we are limited by what astrophysicalprocesses we are expecting to find within the constraint of beingmutually orthogonal.

This in part constrains our ability to identify new nucleosyn-thetic processes. For example, let us consider a hypothetical

situation where we would only expect there to be fournucleosynthetic processes that predominately contribute to theobserved Galah abundances but in practice we found that thedata are best explained with five latent factors. We constructa rotation matrix where the first four latent factors describethe nucleosynthetic processes that we expect to find. What ofthe fifth latent factor? We can constrain the possible values of thefifth latent factor conditioned on the requirement that all factorsremain mutually orthogonal, but one can imagine that some(or perhaps many) elements have entries where the fifth latentfactor can have near-zero or zero entries. Even if the mean

Figure 8. Factor scores S estimated in Experiment3 (Section 3.3 using N=2566 stars in the Galah data (Buder et al. 2018) that have 17 abundance measurements.Here, each star is colored by its inferred Kth component.

13

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 14: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

nucleosynthetic yields are mutually orthogonal, there arescenarios that one can imagine where there is a limited amountthat we can say with confidence about that new nucleosyntheticprocess (see also Milosavljevic et al. 2018). There are similarlimitations that arise due to our assumption about the clusteringin latent space. There is no justified reason why the factor scoresshould be well-described by multivariate normal distributions. Ifthe true underlying scores were not distributed as multivariatenormals, then one can imagine similar outcomes when directlyfitting data with a mixture of Gaussian distributions: additionalcomponents would be required to describe complex (non-Gaussian) shapes in data space. This situation of modelmismatch is more extreme when fitting only data rather thanthe model described here because some of the data complexitywill be described by the orthogonal latent factors. However, thepicture is qualitatively the same: when the true underlyingdistribution in factor scores are not described by multivariatenormals, additional components will likely be introduced inorder to describe non-Gaussian features.

Notwithstanding these issues, we have shown that a latentfactor model which allows for clustering in latent space canadequately describe chemical abundance data. We find sixlatent factors from a small subset of Galah data with complete

abundances, and those latent factors can qualitatively bedescribed within the context of astrophysical yields. Theselatent factors are recovered in larger samples where the data areincomplete. However, that did not have to be the case: themutually orthogonal latent factors could be entirely differentfrom our expectations, such that they did not have to match ourexpectations of nucleosynthetic yields. Indeed, the inferredfactors—even after a valid rotation—could have made noastrophysical sense whatsoever. For this reason, it is encoura-ging that there is some interpretability in the latent factors.Indeed, the elements where we find surprising associations(e.g., Sc and K) are the elements where galactic chemicalevolution models are most discrepant from observations, evenafter accounting for systematic errors in abundance measure-ments (e.g., violations to the assumption of LTE).In the subset of Galah data with complete abundances, we

find that three components are preferred. These componentscan be described as those with (1) low- and (2) high-[α/Fe]abundance ratios, and another (3) primarily differing in K, Ba,and Zn abundances at a given [Fe/H] and [α/Fe] abundanceratio. When we include ∼160,000 stars with up to 17abundances, and assume that the incomplete abundances aremissing at random, we find that 16 components in latent space

Figure 9. Detailed chemical abundances from the N=2566 stars in Galah (Buder et al. 2018) that have 17 chemical abundances (Section 3.3). Each star is colored byits Kth inferred component from the lower-dimensional latent space, with the same coloring as per Figure 8.

14

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 15: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

Figure 10. Latent factors found in Section 3.4 using different subsets of Galah data (Buder et al. 2018). The thin dotted line shows the result from Section 3.3 withN=2566 stars with 17 abundances and no missing data. The thicker dotted–dashed line indicates =N 12,566 stars, of which 10,000 have missing data entries. Thethickest line has ~N 159,808 stars, where ten latent factors are preferred.

15

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 16: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

are preferred to explain the data. By construction, thesecomponents are structured in their chemical abundancesbecause of the projection from the latent space. Furthermore,by extension of each component having similar chemistry,each component occupies realistic locations in a Hertzsprung-Russell diagram. When we project these component associa-tions to the data space we find that none of the inferredcomponents are structured or coherent in their positions ormotions. However, in this sample of stars, there are nogravitationally bound clusters where a reasonable (e.g., ∼30)number of stars have been observed. Clearly, more data wouldhelp to resolve a higher number of components.

Perhaps it is not so discouraging that none of the inferredcomponents are structured in their positions or motions becausethere are no gravitationally bound clusters in the data, but thereis clearly more that can be done in chemical tagging. Somecomponents we infer have stars with positions and galacticorbits, which would imply that they cannot have formed in thesame star cluster. In these situations, there is likely significantvalue in including joint probabilities on whether two starscould be associated to the same star formation site based ontheir dynamic properties. Similarly, although stellar ages arehistorically difficult to estimate precisely, can this impreciseinformation help inform weak priors or probabilities of twostars having the same association? There is an incredibleamount of dynamical information available from Gaia,particularly for stars in the Galah survey, and weaklyinformative priors might be sufficient to help improve thegranularity of chemical tagging without being overly con-straining on the dynamical and star formation history that weseek to infer.

5. Conclusions

We have introduced a data-driven model of nucleosynthesisby incorporating latent factors that are common to all stars andallowing for clustering in the lower-dimensional latent space.This approach simultaneously allows us to efficiently tag starsbased on their chemical abundances and to infer the contribu-tions that are common to all stars (e.g., nucleosynthetic yields).Experiments with generated data demonstrate that MML is auseful principle for selecting the appropriate number of latentfactors and components. Experiments with Galah data reveal

latent factors that are qualitatively and quantitatively similar toexpected nucleosynthetic yields (e.g., products from the s-process, r-process, etc.). Interestingly, we find that deviationsfrom expected yields occur in elements where observations andgalactic chemical evolution models are most discrepant (e.g., K,Sc). While we advise caution in directly interpreting these latentfactors as being nucleosynthetic yields, our model does providethe first data-driven approach to nucleosynthesis and chemicaltagging. We advocate that more data, and the inclusion ofweakly informative priors—joint probabilities using astrometryand a simplified model of the Milky Way—would help inrealizing the full potential of chemical tagging.

We thank the anonymous referee for a detailed review. Weacknowledge support from the Australian Research Councilthrough Discovery Project DP160100637. J.B.H. is supportedby a Laureate Fellowship from the Australian ResearchCouncil. Parts of this research were supported by the AustralianResearch Council (ARC) Centre of Excellence for All SkyAstrophysics in 3 Dimensions (ASTRO 3D), through projectnumber CE170100013. S.~B. acknowledges funds from theAlexander von Humboldt Foundation in the framework of theSofja Kovalevskaja Award endowed by the Federal Ministry ofEducation and Research. S.B. is supported by the AustralianResearch Council (grants DP150100250 and DP160103747).S.L.M. acknowledges the support of the UNSW ScientiaFellowship program. J.D.S., S.L.M., and D.B.Z. acknowledgethe support of the Australian Research Council throughDiscovery Project grant DP180101791. The Galah survey isbased on observations made at the Australian AstronomicalObservatory, under programmes A/2013B/13, A/2014A/25,A/2015A/19, and A/2017A/18. We acknowledge the tradi-tional owners of the land on which the AAT stands, theGamilaraay people, and pay our respects to elders past andpresent. This research has made use of NASA’s AstrophysicsData System.Software:Astropy (Astropy Collaboration et al. 2013, 2018),

IPython (Pérez & Granger 2007), Jupyter Notebooks(Kluyver et al. 2016), matplotlib (Hunter 2007), numpy (vander Walt et al. 2011), scipy (Jones et al. 2001), TOPCAT(Taylor 2005).

16

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 17: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

Figure 11. Example code to generate artificial data and fit those data using a mixture of common factor analyzers.

Appendix

Documentation for the software that accompanies this paper is available at https://mcfa.rtfd.io. In Figure 11 we provide code thatgenerates fictitious data from a toy model and fits it.

17

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.

Page 18: A Data-driven Model of Nucleosynthesis with Chemical ... · A Data-driven Model of Nucleosynthesis with Chemical Tagging in a Lower-dimensional Latent Space Andrew R. Casey 1,2,3

ORCID iDs

Andrew R. Casey https://orcid.org/0000-0003-0174-0564John C. Lattanzio https://orcid.org/0000-0003-2952-859XJoss Bland-Hawthorn https://orcid.org/0000-0001-7516-4016Sven Buder https://orcid.org/0000-0002-4031-8553Geraint F. Lewis https://orcid.org/0000-0003-3081-9319Sarah L. Martell https://orcid.org/0000-0002-3430-4163Thomas Nordlander https://orcid.org/0000-0001-5344-8069Jeffrey D. Simpson https://orcid.org/0000-0002-8165-2507Sanjib Sharma https://orcid.org/0000-0002-0920-809XDaniel B. Zucker https://orcid.org/0000-0003-1124-8477

References

Arthur, D., & Vassilvitskii, S. 2007, in Proc. Eighteenth Annual ACM-SIAMSympo. Discrete algorithms, Society for Industrial and Applied Mathematics(Philadelphia, PA: Society for Industrial and Applied Mathematics), 1027

Astropy Collaboration, Price-Whelan, A. M., & Sipócz, B. M. 2018, AJ,156, 123

Astropy Collaboration, Robitaille, T. P., Tollerud, E. J., et al. 2013, A&A,558, A33

Baek, J., McLachlan, G. J., & Flack, L. K. 2010, ITPAM, 32, 1298Bland-Hawthorn, J., Karlsson, T., Sharma, S., Krumholz, M., & Silk, J. 2010a,

ApJ, 721, 582Bland-Hawthorn, J., Krumholz, M. R., & Freeman, K. 2010b, ApJ, 713, 166Bland-Hawthorn, J., & Sharma, S. 2016, AN, 337, 894Bovy, J. 2016, ApJ, 817, 49Buder, S., Asplund, M., Duong, L., et al. 2018, MNRAS, 478, 4513Casey, A. R., Hawkins, K., Hogg, D. W., et al. 2017, ApJ, 840, 59Casey, A. R., Ho, A. Y. Q., Ness, M., et al. 2019, ApJ, 880, 125Casey, A. R., Hogg, D. W., Ness, M., et al. 2016a, arXiv:1603.03040Casey, A. R., Ruchti, G., Masseron, T., et al. 2016b, MNRAS, 461, 3336Casey, A. R., & Schlaufman, K. C. 2015, ApJ, 809, 110Cohen, J. G., & Kirby, E. N. 2012, ApJ, 760, 86Cramér, H. 1946, Mathematical Methods of Statistics (Princeton, NJ: Princeton

Univ. Press)Dempster, A. P., Laird, N. M., & Rubin, D. B. 1977, Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 39, 1De Silva, G. M., Freeman, K. C., Bland-Hawthorn, J., et al. 2015, MNRAS,

449, 2604Dotter, A., Conroy, C., Cargile, P., & Asplund, M. 2017, ApJ, 840, 99Dowe, D. L. 2008, CompJ, 51, 523Dowe, D. L. 2011, in Handbook of the Philosophy of Science, Vol. 7,

Philosophy of Statistics, ed. P. S. Bandyopadhyay & M. R. Forster(Amsterdam: Elsevier), 901

Dowe, D. L., Gardner, S., & Oppy, G. 2007, The British Journal for thePhilosophy of Science, 58, 709

Dowe, D. L., & Wallace, C. S. 1997, in Proc. Computing Science and Statistics28, 28th Symp. Interface, 614

Edwards, R. T., & Dowe, D. L. 1998, in Proc. 2nd Pacific-Asia Conference onResearch and Development in Knowledge Discovery and Data Mining(PAKDD-98) 1394, Lecture Notes in Artificial Intelligence (LNAI), ed.X. Wu, R. Kotagiri, & K. B. Korb (Berlin: Springer), 96

Figueiredo, M. A. T., & Jain, A. K. 2002, ITPAM, 24, 381Fitzgibbon, L. J., Dowe, D. L., & Vahid, F. 2004, in Proc. Int. Conf. IEEE,

Intelligent Sensing and Information Processing, 439Freeman, K., & Bland-Hawthorn, J. 2002, ARA&A, 40, 487Givens, W. 1958, J. SIAM, 26Golub, G. H., & Reinsch, C. 1970, NuMat, 14, 403Haar, A. 1933, The Annals of Mathematics, 34, 147

Ho, A. Y. Q., Ness, M. K., Hogg, D. W., et al. 2017a, ApJ, 836, 5Ho, A. Y. Q., Rix, H.-W., Ness, M. K., et al. 2017b, ApJ, 841, 40Hogg, D. W., Casey, A. R., Ness, M., et al. 2016, ApJ, 833, 262Hotelling, H. 1933, Journal of Educational Psychology, 24, 417Hunter, J. D. 2007, CSE, 9, 90Iliadis, C., Karakas, A. I., Prantzos, N., Lattanzio, J. C., & Doherty, C. L. 2016,

ApJ, 818, 98Jones, E., Oliphant, T., Peterson, P., et al. 2001, SciPy: Open source scientific

tools for Python, http://www.scipy.org/Kemp, A. J., Casey, A. R., Miles, M. T., et al. 2018, MNRAS, 480, 1384Kluyver, T., Ragan-Kelley, B., Pérez, F., et al. 2016, in Positioning and Power

in Academic Publishing: Players, Agents and Agendas, ed. F. Loizides &B. Schmidt (IOS Press), 87

Knorr-Held, L. 2000, Statistics in Medicine, 19, 1006Kobayashi, C., Umeda, H., Nomoto, K., Tominaga, N., & Ohkubo, T. 2006,

ApJ, 653, 1145Kullback, S., & Leibler, R. A. 1951, The Annals of Mathematical Statistics,

22, 79Leung, H. W., & Bovy, J. 2019, MNRAS, 483, 3255Martell, S. L., Sharma, S., Buder, S., et al. 2017, MNRAS, 465, 3203Milosavljevic, M., Aleo, P. D., Hinkel, N. R., & Vikalo, H. 2018, arXiv:1809.

02660Mitschang, A. W., De Silva, G., Zucker, D. B., et al. 2014, MNRAS, 438, 2753Mucciarelli, A., Bellazzini, M., Ibata, R., et al. 2012, MNRAS, 426, 2889Ness, M. 2018, PASA, 35, e003Ness, M., Hogg, D. W., Rix, H. W., Ho, A. Y. Q., & Zasowski, G. 2015, ApJ,

808, 16Ness, M., Rix, H. W., Hogg, D. W., et al. 2018, ApJ, 853, 198Oliver, J. J., Baxter, R. A., & Wallace, C. S. 1996, ICML, 364Pérez, F., & Granger, B. E. 2007, CSE, 9, 21Portegies Zwart, S. F., Hut, P., Makino, J., & McMillan, S. L. W. 1998, A&A,

337, 363Price-Jones, N., & Bovy, J. 2018, MNRAS, 475, 1410Rao, C. R. 1945, Bulletin of the Calcutta Mathematical Society, 37, 81, http://

repository.ias.ac.in/71593/Schwarz, G. 1978, The Annals of Statistics, 6, 461Shannon, C. E. 1948, Bell System Technical Journal, 27, 379Sheinis, A., Anguiano, B., Asplund, M., et al. 2015, JATIS, 1, 035002Stewart, G. W. 1980, SJNA, 17, 403Taylor, M. B. 2005, in ASP Conf. Ser. 347, Astronomical Data Analysis

Software and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert (SanFrancisco, CA: ASP), 29

Thompson, B. 2004, Exploratory and Confirmatory Factor Analysis: UnderstandingConcepts and Applications (American Psychological Association)

Ting, Y.-S., Conroy, C., Rix, H.-W., & Cargile, P. 2019, ApJ, 879, 69Ting, Y. S., Freeman, K. C., Kobayashi, C., De Silva, G. M., & Bland-Hawthorn, J.

2012, MNRAS, 421, 1231Tipping, M. E., & Bishop, C. M. 1999, Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 61, 611van der Walt, S., Colbert, S. C., & Varoquaux, G. 2011, CSE, 13, 22Viswanathan, M., Wallace, C., Dowe, D. L., & Korb, K. 1999, 12th Proc.

Australian Joint Conf. on Artificial Intelligence, Vol. 1747, Sidney,Australia, 405

Wallace, C. S. 1995, Technical report, Department of Computer Science,Monash University, 95, 21

Wallace, C. S. 2005, Statistical and Inductive Inference by Minimum MessageLength (Berlin: Springer)

Wallace, C. S., & Dowe, D. L. 1994, in Proc. 7th Australian Joint Conf. onArtificial Intelligence (Singapore: World Scientific), 37

Wallace, C. S., & Dowe, D. L. 1999, CompJ, 42, 270Wallace, C. S., & Dowe, D. L. 2000, Statistics and Computing, 10, 73Wallace, C. S., & Freeman, P. R. 1987, Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 49, 240West, C., & Heger, A. 2013, ApJ, 774, 75

18

The Astrophysical Journal, 887:73 (18pp), 2019 December 10 Casey et al.