Learning the Semantics of Discrete Random Variables ......Learning the Semantics of Discrete Random...

Learning the Semantics of DiscreteRandom Variables: Ordinal or Categorical?

Jose Miguel Hernandez-Lobato1,3,?, James Robert Lloyd3,?,Daniel Hernandez-Lobato2 and Zoubin Ghahramani3

1Harvard University. 2Universidad Autonoma de Madrid.3Cambridge University. ?Equal contributors.

™Neural InformationProcessing SystemsFoundation

1. IntroductionMotivation: When specifying a probabilistic model of data, the form of the model willtypically depend on the spaces in which random variables take their values.

Problem: Automatic data analysis techniques must identify the type of data withoutsupervision. It is not trivial to distinguish between categorical and ordinal data.Furthermore, infering the ordering of the labels in the case of ordinal data is difficult.

Continuous Data

−0.90, 0.18, 1.59, −1.13, −0.08, ...

−4 −2 0 2

Count Data

12, 10, 5, 7, 12, 11, 4, 8, 11, 4, ...

0 5 10 15 20

Categorical Data

2, 2, 2, 4, 1, 4, 2, 4, 2, 2, ...

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Ordinal Data

3, 3, 3, 3, 1, 3, 3, 3, 3, 1, ...

1.0 1.5 2.0 2.5 3.0 3.5 4.0

1:"Physics"2:"Statistics"3:"Algebra"4:"Calculus"5:"Operating Systems"

Categorical Labels:

1:"very low"2:"low"3:"medium"4:"high"5:"very high"

Oridinal Labels:

Solution:We present some first attempts at this problem by fiting ordinal regressionand multi-class classification models and then evaluating their quality of fit. Ourordinal regression models can learn the true ordering in ordinal data.

4. Multi-class Classification for Categorical DataWe have that yi = arg maxk ∈ L fk(xi), where fv1, . . . , fvL are latent functions sampledfrom a GP. Define f = (fv1(x1), fv1(x2), . . . , fv1(xn), . . . , fvL(x1), fvL(x2), . . . , fvL(xn))T.The likelihood of f given y = (y1, . . . , yn)T and X = (x1, . . . , xn)T is

p(y|X, f) =n∏

∏k 6=yi

Θ (fyi(xi)− fk(xi))

.Define fvl = (fvl(x1), . . . , fvl(xn))T. The prior for f is p(f) =

∏Ll=1N (fvl|0,Kvl).

EP approximates the posterior p(f|D) as q(f). The predictive distribution for y? is

p(y?|x?,D) ≈∫

p(y?|f?)p(f?|f)q(f)df ,

where f? = (fv1(x?), . . . , fvL(x?))T and p(y?|f?) =∏

k 6=y? Θ (fy?(xi)− fk(xi)). This canbe computed by solving a one-dimensional numerical integral.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

Labels

−1.0

−0.5

−1.0

−0.5

f_1(x)

−15−10

f(x)f (x) function1

−1.0

−0.5

−1.0

−0.5

f_5(x)

−15−10

f(x)f (x) function5

1 2 3 4 5

6. Results for Identifying the Type of DataWe compare OR-L and OR-SE with a multi-class clasiffier (MC) based on linear (MC-L)and squared exponential (MC-SE) covariance functions.

We consider the ordinal regression tasks and four additional multi-class datasets:Glass, Iris, New Thyroid, and Wine with 6, 3, 2 and 3 class labels, respectively.

Method OR-L OR-SE MC-L MC-SEAuto -0.679 -0.726 -0.874 -0.706Boston -0.901 -0.795 -0.957 -0.856Fires -1.044 -1.070 -1.050 -1.084Yacht -0.181 -0.180 -0.897 -0.207Wins 4 0

Table : Avg. Test LL. Ordinal Tasks.

Method OR-L OR-SE MC-L MC-SEGlass -0.224 -0.133 -1.264 -0.096Iris -0.079 -0.092 -0.331 -0.112Thyroid -0.065 -0.077 -0.187 -0.066Wine -0.205 -0.113 -0.076 -0.103Wins 1.5 2.5Table : Avg. Test LL. Multi-class Tasks.

2. Ordinal Regression when the Ordering is KnownAssume a dataset D = {xi, yi}n

i=1, where yi ∈ L = {1, . . . , L} and σ is apermutation of 1,. . . ,L such that σ(1), . . . , σ(L) is correctly ordered.

A sample f from a Gaussian process (GP) maps the xi to the real line, which is splitinto L contiguous intervals with boundaries b0 < . . . < bL, b0 = −∞ and bL =∞.

Let fi = f (xi). The likelihood for fi and b = (b1, . . . , bL−1) given yi is then

p(yi|fi, b, σ) =L−1∏l=1

Θ [sign(σ(yi)− l − 0.5)(fi − bl)] , (1)

where Θ is the Heaviside function, p(b) =∏L−1

l=1 N (bl|m0l , v

0l ) and p(f) = N (f|m,K).

Expectation propagation (EP) approximates the exact posterior p(f, b|D) as q(f, b).The predictive distribution for the label y? of a new vector x? is approximated as

p(y?|x?,D) =

∫p(y?|f?, b)p(f?|f)p(f, b|D) ≈

∫p(y?|f?, b)p(f?|f)q(f, b)df b , (2)

where f? = f (x?) and p(f?|f) is the GP predictive distribution for f? given f.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

Labels

−1.0

−0.5

−1.0

−0.5

−15−10

f(x) function

−6 −4 −2 0 2 4 6

Interval Boundaries and Labels

1 2 3 4 5

3. A Search Algorithm for Finding the True OrderingThe EP approximation of the model evidence can be maximized with respect to thehyper-parameters. Let z(σ) be the value of the maximized approximation given σ. Wecan infer σ by further optimizing z(σ) as follows:

Require: Dataset D = {xi, yi}ni=1 with yi ∈ L = {v1, . . . , vL}.

1: Select σ at uniformly at random and compute z(σ).2: Generate set P with all the 2-element subsets of {1, . . . , L}.3: finished← False.4: while Not finished do5: finished← True.6: for every subset {i, j} contained in P do7: Generate σi,j by swapping the elements i and j in σ.8: Compute z(σi,j).9: end for

10: Find indexes {k , l} such that z(σk ,l) ≥ z(σi,j) for any i, j .11: if z(σk ,l) > z(σ) then12: finished← FALSE, σ ← σk ,l, z(σ)← z(σk ,l)13: end if14: end while15: return σ

parallel

5. Results for Learning the OrderingRegression problems from the UCI repository. The target variable is discretized usingequal-probability bining: Boston Housing, Forest Fires, Auto MPG and Yatch.

We fix L = 5, except in Forest Fires, where we fix L = 3. The accuracy of each methodis computed in terms of the absolute value of Kendall’s tau correlation coefficientbetween the true ranking of the labels and the ranking discovered by our algorithm.

We use linear (OR-L) andsquared exponential (OR-SE)covariance functions.

Table : Average Kendall’s tau.Method Auto Boston Fires YachtOR-L 1.000 1.000 0.333 1.000OR-SE 0.840 0.968 0.427 1.000

7. Conclusions

I We have focused on distinguishing categorical data from ordinal data.I Our solution works by evalating the fit of ordinal and multi-class models.I We can find the label ranking using a search procedure.I Linear models correctly identify the true ranking most of the times, while

non-linear models are less accurate.I The test log-likelihood can be used to correctly identify the data type.

http://jmhl.org/ jmh@seas.harvard.edu

Learning the Semantics of Discrete Random Variables ......Learning the Semantics of Discrete Random...

Documents

Discrete R.V.4-1 Chapter 4 Distribution Functions and Discrete Random Variables 4.1 Random Variables 4.2 Distribution Functions 4.3 Discrete Random Variables

RANDOM VARIABLES AND PROBABILITY …RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 1. DISCRETE RANDOM VARIABLES 1.1. Deﬁnition of a Discrete Random Variable. A random variable X

ReviewCh3 Discrete Random Variables

Discrete Random Variables.ppt

Discrete Random Variables - ALHGS... we discuss discrete random variables: continuous random ... distribution can be represented as ... Example 5.2.5 Suppose a random variable Xhas

6.1 Discrete Random Variables

Chapter 4: DISCRETE PROBABILITIES AND RANDOM ...bazuinb/ECE3800/B_Notes04.pdfChapter 4: DISCRETE PROBABILITIES AND RANDOM VARIABLES Sections 4.1 Discrete Random Variable and Probability

discrete and random variables

Discrete Random Variables 2

Discrete random variables - … · Discrete random variables Bernoulli’s random variable ∼ ( ) ∙it is the simplest discrete random variable, it models an experiment in

Discrete Random Variable

Ch. 6 Discrete Probability Distributions...Ch. 6 Discrete Probability Distributions 6.1 Discrete Random Variables 1 Distinguish between discrete and continuous random variables. MULTIPLE

7.1 – Discrete and Continuous Random Variables. Random Variable: A variable whose value is a numerical outcome of a random phenomenon Discrete Random

Section 6.1 – Discrete Random Variables Objectives · Section 6.1 – Discrete Random Variables Objectives: 1. Distinguish between discrete and continuous random variables 2. Identify

2. Random variables Introduction Distribution of a random variable Distribution function properties Discrete random variables Point mass Discrete

Discrete Random Variables - University of Saskatchewanlaverty/S241/S241 Lectures PDF/07 S241 Dis… · Discrete random variables For a discrete random variable X the probability distribution

DISCRETE RANDOM VARIABLESathreya/psweur/chapters/03.pdf · 52 discrete random variables The function from Example3.1.4is a discrete random variable because it takes on one of the

TOPIC 10 Discrete (Categorical) Data Analysis. Discrete Random Variables Recall that discrete random variables may take only discrete values. For example,

Chapter 3: Discrete Random Variables...Chapter 3: Discrete Random Variables ... Leon

Chapter 3 – Discrete Random Variables and Probability ... · Variables and Probability Distributions. Outline – Random variables. Outline – Discrete randomDiscrete random