A linear least squares framework for learning ordinal classes Ioannis Mariolis, PhD

A linear least A linear least squares framework squares framework for learning ordinal for learning ordinal classesclasses

Ioannis Mariolis, PhDIoannis Mariolis, PhD

Outline• Introduction to Ordinal Data Modeling• Generalized Linear Models

– Ordinary Least Squares (OLS) Regression– Ordinal Logistic Regression (OLR)

• Linear Classifier of Ordinal Classes– learns a linear model

• modifies OLS regression• Experimental Results

– synthetic datasets– real datasets

• visual features• textile seam quality control

• Conclusions

IntroOrdinal Data Modeling

• Collection of measurements called data

• Building a model to fit the data• The term ordinal refers to the scale of

measurement of the data

IntroScales of Measurement

• Measurement is the assignment of numbers to objects or events in a systematic fashion

• Four levels of measurement scales are commonly distinguished– Nominal– Ordinal– Interval– Ratio

IntroNominal Scale

• Nominal measurement consists of assigning items to groups or categories

• No quantitative information is conveyed and no ordering of the items is implied– qualitative rather than quantitative

• Variables measured on a nominal scale are often referred to as categorical or qualitative variables

IntroOrdinal Scale

• Measurements with ordinal scales are ordered – higher numbers represent higher values

• The intervals between the numbers are not necessarily equal

• There is no "true" zero point for ordinal scales– the zero point is chosen arbitrarily

IntroInterval Scale

• On interval scales, one unit represents the same magnitude across the whole range of the scale

• Interval scales do not have a "true" zero point

• It is not possible to make statements about how many times higher one score on that scale is than another– e.g. the Celsius scale for temperature

• equal differences on this scale represent equal differences in temperature

• but a temperature of 30 degrees is not twice as warm as one of 15 degrees

IntroRatio Scale

• Ratio scales are like interval scales except they have true zero points– e.g. the Kelvin scale of temperature

• this scale has an absolute zero• a temperature of 300 Kelvin is twice as high

as a temperature of 150 Kelvin

IntroRatio Scale

• Ratio scales are like interval scales except they have true zero points– e.g. the Kelvin scale of temperature

• this scale has an absolute zero• a temperature of 300 Kelvin is twice as high

as a temperature of 150 Kelvin.

Earth’s mean temperature is about 14o C (287o K), and it drops as a function of the earth-sun distance’s square root. Thus, doubling the distance results to a factor of ~1.4 decrease in temperature. The calculations should be made in Kelvin (287/1.4=205) resulting to a difference of 82 degrees. The new temperature would be -68o C and not 14/1.4=10o C

Intro Classification to Ordinal Classes

• Pattern classification addresses the issue of assigning objects to different categories called classes

• Most often those classes are of nominal scale– discrete classes– with no established relationship among them

• In some cases, additional information regarding the arrangement of the classes is available– e.g. an order among the classes is exhibited– in that case the predicted classes are of ordinal scale – classification is bridged to metric regression in a setting

called ranking learning or ordinal regression

Intro Classification to Ordinal Classes

• Pattern classification addresses the issue of assigning objects to different categories called classes

• Most often those classes are of nominal scale– discrete classes– with no established relationship among them

• In some cases, additional information regarding the arrangement of the classes is available– e.g. an order among the classes is exhibited– in that case the predicted classes are of ordinal scale – classification is bridged to metric regression in a setting

called ranking learning or ordinal regression.

applied to variables measured on interval

or ratio scales

Intro State of the Art• Ordinal regression problems have been addressed

in both machine learning and statistics domain• In Frank (2001) classes’ ordering was

encoded by a set of nested binary classifiers

– the classification results were organized for prediction accordingly

• A constrained classification approach, based on binary classifiers, was proposed in Har-Peled (2003)

• A loss function between pair of ranks was used in Herbrich (2000)

– employing distribution independent methods

• Modifications of support vector machines have been proposed in Shashua (2003), Chu (2005), Pelckmans (2006)

– incorporating in the design of SVMs information regarding the order of the classes

• A probabilistic kernel approach to ordinal regression was proposed by Chu (2005)

• In McCullagh (1980) multinomial logistic regression is extended to apply to ordinal data by using cumulative probabilities

– proportional odds model– proportional hazards model

• In Tutz (2003) generalized additive models were extended into a semi-parametric approach

– based on the maximization of penalized log likelihood

– choice of used parameters based on minimization of the Akaike criterion

• In Johnson (1999) sampling techniques were employed in order to apply Bayesian inference on parametric models for ordinal data

• In Krammer (2001) and Torra (2006) the ordinal values are transformed into numeric, and then standard metric regression analysis is performed


















Extending Binary Classifiers


















Extending SVM Classifiers


















Explicitly Ordinal Approach


















Treat Ordinal Data as Numeric


















Treat Ordinal Data as Numeric

Ordinary Least Squares will be implied when referring to Metric

Regression

GLMsGeneralized Linear Models

• GLMs are a generalization of the OLS regression– were formulated as a way of unifying under one framework

• linear regression• logistic regression• Poisson regression

– a general algorithm for maximum likelihood estimation in all these models has been developed

• According to GLM theory– a linear predictor is related the distribution function of the dependent

variables through a link function– each outcome of the dependent variables, Y, is assumed to be

generated from a particular exponential-type probability density function

• Normal, Binomial, Poisson distributions, etc• The mean, μ, of the distribution depends on the independent variables, x,

through:

,

where E{Y} is the expected value of Y; g is the link function; b are the unknown weights of the linear model

– The unknown weights b, called also regression coefficients, are typically estimated with maximum likelihood or Bayesian techniques

1E gY xb









through:

,



1E gY xb

In case Y follows the Normal distribution and g is the identity

function, the GLM is the standard linear regression model









through:

,



1E gY xb

In the context of this presentation x corresponds to feature vectors and Y to

classes

GLMsOrdinary Least Squares

• The simplest and very popular GLM• The distribution function is the normal distribution

with constant variance and the link function is the identity

• Unlike most other GLMs, the maximum likelihood estimates of the linear weights are provided in a closed form solution

• X is the matrix consisting of all available feature vectors x

• Y is the vector consisting of the observed values of the dependent variables Y

• The model’s linear weights b are given by

E Y xb

-1T Tb = X X X Y

GLMsOrdinary Least Squares (cont.)

• OLS is designed to process interval or ratio variables

• OLS estimates are likely to be satisfactory from a statistical perspective when an ordinal level variable is examined – if it is measured in a relatively high number of ascending

categories– if it can be assumed that the interval each category

represents, is the same as the prior interval• Thus, OLS can be applied to ordinal measurements

treated as if they were interval– it is most likely that some of the assumptions of the

Gauss-Markov theorem are not met and the regression is not the Best Linear Unbiased Estimator

GLMsOrdinal Logistic Regression

• Explicitly takes into account an ordered categorical dependent variable and does not assume any specific distance among the categories

• Different regression models that can be applied in case of ordinal measurements are proposed

– the proportional odds model is assumed• Like in multinomial logistic regression (MLR), in OLR

– a multinomial distribution is assumed– the logit is selected as the link function

• The main difference between MLR and OLR is that rather than estimating the probability of a single category, OLR estimates a cumulative probability

– i.e. the probability that the outcome is equal to or less than the category of interest c

1

P Pc

iY c Y i

GLMsOrdinal Logistic Regression

• Explicitly takes into account an ordered categorical dependent variable and does not assume any specific distance among the categories

• Different regression models that can be applied in case of ordinal measurements are proposed

– the proportional odds model is assumed• Like in multinomial logistic regression (MLR), in OLR

– a multinomial distribution is assumed– the logit is selected as the link function

• The main difference between MLR and OLR is that rather than estimating the probability of a single category, OLS estimates a cumulative probability

– i.e. the probability that the outcome is equal to or less than the category of interest c

1

P Pc

iY c Y i

c denotes the integer values used to label the classes

GLMsOrdinal Logistic Regression (cont.)

• the proportional odds model employs the cumulative probability’s logit equation

• The threshold values are different for each category

• The weights of the linear model contained in vector b are assumed to remain constant for every category

• A Log-Likelihood function (LL) is created and the parameter values that maximize that function are estimated using computational methods

PP

logit ln1 P cY c

Y c

Y c

xb

c

Using the Logit equation, the probabilities for each instance belonging to each class can be estimated

LCOC

arg min , 1, 2,...kk

j y z k K

Linear Classifier of Ordinal Classes

• Numerical mapping of the K ordered classes . into real numbers

• Classification is based on the assumption of a linear relationship between – the numerical input vectors and– the numerical values assigned to the ordered classes

• A linear output y is produced as the dot product of input vector x and vector b containing the weights of the linear model

• The output o derives as the class ωj assigned with the numerical value j that is the nearest to the linear output y. j is given by

1 2ω ω ωK 1 2 Kz z z In case of metric regression a numerical mapping is needed and the results do not correspond to probabilities

LCOC

arg min , 1, 2,...kk

j y z k K

Linear Classifier of Ordinal Classes

• Performs numerical mapping of the K ordered classes . into real numbers

• Classification is based on the assumption of a linear relationship between – the numerical input vectors and– the numerical values assigned to the ordered classes

• A linear output y is produced as the dot product of input vector x and vector b containing the weights of the linear model

• The output o derives as the class ωj assigned with the numerical value j that is the nearest to the linear output y. j is given by

1 2ω ω ωK 1 2 Kz z z In case of metric regression a numerical mapping is needed and the results do not correspond to probabilities

LCOCTraining LCOC-the naïve case

• Arbitrary consequent numbers are assigned to the ordered classes:

• The linear output of the classifier is xb where vector b has been estimated by minimizing the Sum of Squared Errors (SSE)

matrix X is the design matrix consisting of all available input vectors, t denotes the vector of the corresponding targets

• Then

ω ω , 1, 2,...,i ikif then t k k K

TSSE t Xb t Xb

arg min SSE-1T T

bb = X X X t

LCOC Training LCOC-the proposed case

• Target vector t is decomposed into a product of – a known matrix S coding the target classes of the training

samples– and a parameter vector z of elements containing the unknown

numerical values assigned to the K classes

• SSE becomes

• where

• SSE minimization revisited

TSSE Sz Xb Sz Xb

,

1 if ω ω

0 otherwise

ij

i jS

T

,{ , } arg min

ζ bζ b Sz Xb Sz Xb

21

3

1

and

KK

zA

z

Az

z ζ ζ

Least Squares Ordinal Classification

(LSOC)

LCOC Training LCOC-the proposed case

• Target vector t is decomposed into a product of – a known matrix S coding the target classes of the training

samples– and a parameter vector z of elements containing the unknown

numerical values assigned to the K classes

• SSE becomes

• where

• SSE minimization revisited

TSSE Sz Xb Sz Xb

,

1 if ω ω

0 otherwise

ij

i jS

T

,{ , } arg min

ζ bζ b Sz Xb Sz Xb

21

3

1

and

KK

zA

z

Az

z ζ ζ

A1, AK selection does not affect the classification results


(LSOC)

LCOC Training LCOC-the proposed case (cont.)

• Since SSE is quadratic with respect to b and z, setting the partial derivatives of SSE to zero results to

• where ,

• if the estimated z parameters were also employed by OLS the same b parameters would have been estimated by both training methods

• the estimated ζ values are in fact the intra-class average values of the linear outputs

• By substituting in the second equation the b vector given in the first the system of linear equations becomes

T -1 Tb = (X X) X SzT Pζ S Xb 0

2 3 1K S S S S2 3 1( , , , )Kdiag p p p P

T T T T -1 T1 1 , K KA A P S HS ζ S HS S HS H X(X X) X


(LSOC)

LCOCInvariant Error Measure

• When the numerical values of the classes are not fixed the classification results do not depend only on the magnitude of the error, but also on the distance among the classes

• Proposed measure that is also minimized by LSOC training method

• However unlike SSE– Takes into account the distance between the classes– is invariant to the selection of the bounding values A1

and AK since

2 values

2 22 2

1

SSE1 SSEMSE

SMD

1

monotone

z z

K

KMr rM A AD

K

z

2zr

21

1

K

K

A A SSE

A A SSE

ExperExperimental Evaluation

• Both synthetic and real datasets are examined• Synthetic input vectors were produced by means of a random

number generator– arbitrary linear model produces linear targets– quantizing linear targets produces class targets

• quantization levels correspond to ordered classes– initial error introduced into the linear model only by quantization– the performance of the proposed training method was also assessed in

case of weaker linear dependency• Additive White Gaussian Noise (AWGN) has been introduced into the linear

model before quantization• Real datasets involve visual inspection of seam specimen classified

to five grades of quality– the critical assumption of linear dependency is unverified

• if not valid, the classification accuracy of the LSOC is anticipated to be as poor as the one of OLS or even worse

– the produced results were also compared to those of Ordinal Logistic Regression (OLR)

• OLR yields a good choice for comparison, since its model employs the same number of parameters with those of LSOC

• however, OLR relies on computational methods to estimate these parameters, whereas LSOC employs a closed form solution

Exper Synthetic Datasets• Using a uniform random generator were artificially generated

– 1000 5-dimensional input vectors – the vectors were augment by adding an extra unit element– grouped into a design matrix of size 1000×6– 6 arbitrary values were randomly selected as the weights of the linear

model– the design matrix was multiplied with the weights’ vector and the

vector of the linear targets has been created• consisting of 1000 values linearly dependent on the corresponding input

vectors– the elements of the linear targets’ vector were positioned in

monotonically increasing order by rearranging accordingly the rows of matrix

• The 1st Synthetic Dataset contains 10 ordered classes with 100 input vectors in each class

– the 1000 input vectors were grouped together in hundreds• the first 100 input vectors of matrix were classified to the first class, and so

on until the 10th class• The 2nd Synthetic dataset used the same design matrix and vector

of linear weights – the 1st and the 2nd class were assigned with 300 input vectors each– the 8 remaining classes were assigned with 50 vectors each– the class targets of the input vectors are different for the second

dataset

ExperSynthetic Datasets

Euclidian Distance of z values from the norm. centers

• 1st dataset– LSOC: 0.05– OLS: 0.32

• 2nd dataset– LSOC: 0.54– OLS: 0.90


















R2 denotes the coefficient of determinationCA denotes Classification AccuracyV denotes 10-fold Cross-Validation


• AWGN has been introduced into the estimation of the linear targets• The Mean Distance (MD) among the classes has been calculated • the standard deviation of the added noise was set to be 5% of MD

to 100% of MD– with a 5% of MD increment

• Thus, for each dataset 20 different cases with increasing ratios were constructed and tested

1st synthetic dataset 2nd synthetic dataset

Exper Real Datasets• Image database of 325 seam specimens, belonging

to three different types of fabric• Specimen size approximately 20×4 cm• A committee of three experts labelled each

specimen by assigning a grade denoting the quality of the seam– 1 (worse) to 5 (best)

• For each specimen three ratings are assigned– the median is selected as the actual grade– the average agreement of each expert to the median

ratings has been 80.3% ±1.8%.• 3 different feature sets all based on intensity curves

– Roughness features– FFT features– Fractal features

• 4 different features in each set

Exper

ISO 7700Standard

Textile Seam Quality Control

Exper

Pre-process

(a) (b) (c) (d) (e)

(a) (b) (c) (d) (e)

Exper

IntensityCurves

Exper

IntensityCurves

Exper

γραμ

μή ε

ικόν

ας

I (2)I (1) I (3) I (4)

IntensityCurves

Exper

γραμ

μή ε

ικόν

ας

I (2)I (1) I (3) I (4)

,

( ) ( )

1

1 j

m n

Nj j

mnj

S IN

IntensityCurves

Mean intensity values (column-wise)

S (1) S (2) S (3) S (4)

Exper

FeatureExtraction

Roughness Features

• Moving Average filter

• Intensity Deviation

2( ) ( )

2

1 m Wj j

m kk m W

MV SW

( ) ( )

1

1 Mj j

j m mm

R S MVM

Exper

FeatureExtraction

FFT Features

• Using the first 40 FFT coefficients produced from each intensity curve• Applying averaging using different window centers and sizes• Selecting the window settings that present the highest correlation with the

quality grades

Exper

FeatureExtraction

Fractal Features• Modified Pixel Dilation method (MPD) is applied to

an intensity curve estimating its fractal dimensions– Each intensity curve is treated as binary image– n successive dilation operations are performed– The area S(n) occupied by the produced curves and the

area E(n) occupied by a single pixel that has been dilated by the same morphological operator are calculated for different values of n

– The relationship among the fractal dimension D, S(n), and E(n), is given by

2

( )( )

( )

DS nrE n

E n

ExperRoughness Results

• LSOC improves results of the naïve case– outperforms OLS if >20 training samples

• LSOC generalize better than OLR in limited training set– outperforms OLR if <45 training samples

ExperFFT Results

• Similar to RF results• LSOC’s performance is even closer to OLR’s

– indicates stronger linear relationship between FFT features and quality grades

ExperFractal Results

• Different from RF or FFT results• Both metric methods are outperformed by OLR even for limited training sets• LSOC’s performance is slightly worse than OLS’s• Indicate weak linear relationship between Fractal features and quality grades

ExperSummarizing Results

• In case of the synthetic datasets– the linear dependency between feature vectors and class values is

established– proposed method produces significantly better results than the naïve

approach– the difference in the performance is even greater in case of the 2nd

synthetic dataset, where the intervals between the classes are less uniform

• In case of real datasets– OLR presents the highest performance for all feature sets

• provided a large number of training samples is available– LSOC presents, for almost every case, higher classification accuracy than

the one using OLS– If the linear relation between the inputs and the outputs is not very

strong, the proposed method is not likely to outperform the naïve approach

• In such cases, however the performance of both classifiers is very poor anyway, thus other approaches, like OLR, should be considered

ConclConclusion

• A common strategy for selecting an appropriate classification method for a specific task is

– start with the simplest one and check its performance– if the performance is not adequate more complex methods are

considered• The OLS regression approach is by far the simplest of all ordinal

classification methods– presenting computational efficiency – ease of implementation

• In the naïve case arbitrary numerical values are assigned to the ordered classes

– inappropriate numerical mapping can result to poor classification performance

• LSOC, estimates an optimal mapping using a novel goodness of fit measure

– like in OLS a linear model is employed– the model’s parameters derive through a closed form expression – the computational efficiency of the naïve approach is retained

ConclConclusion (cont.)

• In the experimental evaluation it was demonstrated that if LSOC is used instead of OLS the classification accuracy can be significantly increased

– the accuracy of 76 % and 39 % presented by OLS in case of the 1st and 2nd synthetic dataset was increased to 93 % and 83 %, respectively, in case of LSOC

– a similar trend was present both when Gaussian noise was added to the synthetic datasets and in case of real datasets.

• LSOC was also compared to OLR– more sophisticated method explicitly designed to handle ordinal data– even though OLR achieves higher accuracy when a large number of

training samples is employed, it is outperformed by LSOC when this number decreases

– LSOC can be an attractive choice in case a limited number of training samples are available

– due to its computational simplicity LSOC is also an attractive choice if speed of calculations is an issue.

• In future work the performance of LSOC can be further investigated in case non-linear kernels are applied to the original input vectors

– transferring them in a higher-dimensional space where linearity holds

Concl

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

George E.P. Box

Documents

A linear least squares framework for learning ordinal classes Ioannis Mariolis, PhD