82
Data Preprocessing Javier Béjar URL - Spring 2020 CS - MAI

Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Data Preprocessing

Javier Béjar

URL - Spring 2020

CS - MAI

Page 2: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Introduction

Page 3: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Data representation

• Unstructured datasets:

• Examples described by a flat set of attributes: attribute-valuematrix

• Structured datasets:• Individual examples described by attributes but with relations

among them: sequences (time, spatial, ...), trees, graphs

• Sets of structured examples (sequences, graphs, trees)

URL - Spring 2020 - MAI 1/78

Page 4: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Unstructured data

• Only one table of observations

• Each example represents aninstance of the problem

• Each instance is represented by aset of attributes (discrete,continuous)

A B C · · ·1 3.1 a · · ·1 5.7 b · · ·0 -2.2 b · · ·1 -9.0 c · · ·0 0.3 d · · ·1 2.1 a · · ·...

...... . . .

URL - Spring 2020 - MAI 2/78

Page 5: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Structured data

• One sequential relation amonginstances (Time, Strings)• Several instances with

internal structure• Subsequences of

unstructured instances• One large instance

• Several relations amonginstances (graphs, trees)• Several instances with

internal structure• One large instance

Page 6: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Data Streams

• Endless sequence of data• Several streams

synchronized

• Unstructured instances

• Structured instances

• Static/Dynamic model

URL - Spring 2020 - MAI 4/78

Page 7: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Data representation

• Most unsupervised learning algorithms are specifically fitted forunstructured data

• The data representation is equivalent to a database table(attribute-value pairs)

• Specialized algorithms have been developed for structured data:Graph clustering, Sequence mining, Frequent substructures

• The representation of these types of data is sometimesalgorithm dependent

URL - Spring 2020 - MAI 5/78

Page 8: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Data Preprocessing

Page 9: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Data preprocessing

• Usually raw data is not directly adequate for analysis

• The usual reasons:• The quality of the data (noise/missing values/outliers)

• The dimensionality of the data (too many attributes/too manyexamples)

• The first step of any data task is to assess the quality of thedata

• The techniques used for data preprocessing are usually orientedto unstructured data

URL - Spring 2020 - MAI 6/78

Page 10: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Outliers

• Outliers: Examples with extreme values compared to the restof the data

• Can be considered as examples with erroneous values

• Have an important impact on some algorithms

Page 11: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Outliers

• The exceptional values could appear in all or only a fewattributes

• The usual way to correct this problem is to eliminate theexamples

• If the exceptional values are only in a few attributes these couldbe treated as missing values

URL - Spring 2020 - MAI 8/78

Page 12: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Parametric Outliers Detection

• Assumes a probabilistic distribution for the attributes

• Univariate• Perform Z-test or student’s test

• Multivariate

• Deviation method: reduction in data variance when eliminated

• Angle based: variance of the angles to other examples

• Distance based: variation of the distance from the mean of thedata in different dimensions

URL - Spring 2020 - MAI 9/78

Page 13: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Non parametric Outliers Detection

• Histogram based: Define a multidimensional grid and discardcells with low density

• Distance based: Distance of outliers to their k-nearest neighborsare larger

• Density based: Approximate data density using Kernel Densityestimation or heuristic measures (Local Outlier Factor, LOF)

URL - Spring 2020 - MAI 10/78

Page 14: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Outliers: Local Outlier Factor

• LOF quantifies the outlierness of an example adjusting forvariation in data density

• Uses the distance of the k-th neighbor Dk(x) of an exampleand the set of examples that are inside this distance Lk(x)

• The reachability distance between two data points Rk(x, y)

is defined as the maximum between the distance dist(x, y) andthe y’s k-th neighbour distance

URL - Spring 2020 - MAI 11/78

Page 15: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Reachability distance

URL - Spring 2020 - MAI 12/78

Page 16: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Outliers: Local Outlier Factor

• The average reachability distance ARk(x) with respect ofan example’s neighbourhood (Lk(x)) is defined as the averageof the reachability distances to the neighbourhood

Page 17: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Local Outlier Factor

• The LOF of an example is computed as the mean ratiobetween ARk(x) and the average reachability of its k neighbors:

LOFk(x) =1

k

∑y∈Lk(x)

ARk(x)

ARk(y)

• This value ranks all the examples

URL - Spring 2020 - MAI 14/78

Page 18: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Outliers: Local Outlier Factor

2 0 2 4 6

1

0

1

2

3

4

5

2 0 2 4 6

1

0

1

2

3

4

5

Page 19: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Missing values

• Missing values appear because of errors or omissions during thegathering of the data

• They can be substituted to increase the quality of the dataset(value imputation)• Global constant for all the values

• Mean or mode of the attribute (global central tendency)

• Mean or mode of the attribute but only of the k nearestexamples (local central tendency)

• Learn a model for the data (regression, bayesian) and use it topredict the values

• Problem: changes the statistical distribution of the dataURL - Spring 2020 - MAI 16/78

Page 20: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Missing values

Missing Values Mean substitution 1-neighbor substitution

URL - Spring 2020 - MAI 17/78

Page 21: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Normalization

Normalizations are applied to quantitative attributes in order toeliminate the effect of having different scale measures

• Range normalization: Transform all the values of theattribute to a preestablished scale (e.g.: [0,1], [-1,1])

x− xminxmax − xmin

• Distribution normalization: Transform the data to a specificstatistical distribution with preestablished parameters (e.g.:Gaussian N (0, 1))

x− µxσx

Page 22: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Discretization

Discretization allows transforming quantitative attributes toqualitative attributes

• Equal size bins: Pick the number of values and divide therange of data in equal sized bins

• Equal frequency bins: Pick the number of values and dividethe range of data so each bin has the same number of examples(the size of the intervals will be different)

URL - Spring 2020 - MAI 19/78

Page 23: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Discretization

Discretization allows transforming quantitative attributes toqualitative attributes

• Distribution approximation: Calculate a histogram of thedata and fit a kernel function (KDE), the intervals are wherethe function has its minima

• Other techniques: Apply entropy based measures, MinimumDescription Length (MDL), clustering

URL - Spring 2020 - MAI 20/78

Page 24: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Discretization

Same size

Same Frequency

Histogram

URL - Spring 2020 - MAI 21/78

Page 25: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Python Notebooks

These two Python Notebooks show some examples of the effect ofmissing values imputation and data discretization and normalization

• Missing Values Notebook (click here to go to the url)

• Preprocessing Notebook (click here to go to the url)

If you have downloaded the code from the repository you will beable to play with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 22/78

Page 26: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Dimensionality Reduction

Page 27: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

The curse of dimensionality

• Problems due to the dimensionality of data

• The computational cost of processing the data

• The quality of the data

• Elements that define the dimensionality of data• The number of examples

• The number of attributes

• Usually the problem of having too many examples can besolved using sampling.

URL - Spring 2020 - MAI 23/78

Page 28: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Reducing attributes

• The number of attributes has an impact on the performance:• Poor scalability

• Inability to cope with irrelevant/noisy/redundant attributes

• Methodologies to reduce the number of attributes:• Dimensionality reduction: Transforming to a space of less

dimensions

• Feature subset selection: Eliminating not relevant attributes

URL - Spring 2020 - MAI 24/78

Page 29: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Dimensionality reduction

• New dataset that preserves most of the information of theoriginal data but with less attributes

• Many techniques have been developed for this purpose• Projection to a space that preserve the statistical distribution of

the data (PCA, ICA)

• Projection to a space that preserves distances among the data(Multidimensional scaling, random projection, nonlinear scaling)

URL - Spring 2020 - MAI 25/78

Page 30: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis

• Principal Component Analysis:

• Data is projected onto a set of orthogonal dimensions(components) that are a linear combination of the originalattributes

• The components are uncorrelated and are ordered by theinformation they have

• We assume data follows gaussian distribution

• Global variance is preserved

URL - Spring 2020 - MAI 26/78

Page 31: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis

Computes a projection matrix where the dimensions are orthogonal(linearly independent) and data variance is preserved

Y

X

w1*Y+w2*Xw3*

Y+w4*X

URL - Spring 2020 - MAI 27/78

Page 32: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis

• Principal components: vectors that are the best linearapproximation of the data

f(λ) = µ+ Vqλ

µ is a location vector in Rp, Vq is a p× q matrix of qorthogonal unit vectors and λ is a q vector of parameters

• The reconstruction error for the data is minimized:

mı́nµ,{λi},Vq

N∑i=1

||xi − µ− Vqλi||2

URL - Spring 2020 - MAI 28/78

Page 33: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis - Computation

• Assuming x̄ = 0 we can obtain the projection matrix bySingular Value Decomposition of the data matrix X

X = UDV T

• U is a N × p orthogonal matrix, its columns are the leftsingular vectors

• D is a p× p diagonal matrix with ordered diagonal values calledthe singular values

• The columns of UD are the principal components

• We can pick the first principal components that account for apercentage of the total variance (e.g. 90%)

URL - Spring 2020 - MAI 29/78

Page 34: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis - Intuition

X

Y

Original Data

Page 35: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis - Intuition

X

Y

First component along the maximum variance of the data

Page 36: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Principal Component Analysis - Intuition

X

Y

Next component maximum variance perpendicular to the othercomponents

Page 37: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Kernel PCA

• PCA is a linear transformation, this means that if data islinearly separable, the reduced dataset will be linearly separable(given enough components)

• We can use the kernel trick to map the original attribute to aspace where non linearly separable data is linearly separable

• Distances among examples are defined as a dot product thatcan be obtained using a kernel:

d(xi, xj) = Φ(xi)TΦ(xj) = K(xi, xj)

URL - Spring 2020 - MAI 33/78

Page 38: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Kernel PCA

• Different kernels can be used to perform the transformation tothe feature space (polynomial, gaussian, ...)

• The computation of the components is equivalent to PCA butperforming the eigen decomposition of the covariance matrixcomputed for the transformed examples

C =1

M

M∑j=1

Φ(xj)Φ(xj)T

• The components are lineal combinations of features in thefeature space

URL - Spring 2020 - MAI 34/78

Page 39: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Kernel PCA

• Pro: Helps to discover patterns that are non linearly separablein the original space

• Con: Does not give a weight/importance for the newcomponents

URL - Spring 2020 - MAI 35/78

Page 40: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Kernel PCA

Page 41: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Sparse PCA

• PCA transforms data to a space of the same dimensionality (alleigenvalues are non zero)

• An alternative is to solve the minimization problem posed bythe reconstruction error using regularization

• A penalization term is added to the objective functionproportional to the norm of the eigenvalues matrix

mı́nU,V‖X − UV ‖22 + α‖V ‖1

• The `-1 norm regularization will encourage sparse solutions(zero eigenvalues)

URL - Spring 2020 - MAI 37/78

Page 42: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Multidimensional Scaling

A transformation matrix transforms a dataset from M dimensions toN dimensions preserving pairwise data distances

[MxN]

URL - Spring 2020 - MAI 38/78

Page 43: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Multidimensional Scaling

• Multidimensional Scaling: Projects the data to a space withless dimensions preserving the pair distances among the data

• A projection matrix is obtained by optimizing a function of thepairwise distances (stress function)

• The actual attributes are not used in the transformation

• Different objective functions that can be used (least squares,Sammong mapping, classical scaling, ...).

URL - Spring 2020 - MAI 39/78

Page 44: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Multidimensional Scaling

• Least Squares Multidimensional Scaling (MDS)

• The distorsion is defined as the square distance between theoriginal distance matrix and the distance matrix of the new data

SD(z1, z2, ..., zn) =

[∑i 6=i′

(dii′ − ‖zi − zi′‖2)2]

• The problem is defined as:

arg mı́nz1,z2,...,zn

SD(z1, z2, ..., zn)

URL - Spring 2020 - MAI 40/78

Page 45: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Multidimensional Scaling

• Several optimization strategies can be used

• If the distance matrix is euclidean it can be solved using eigendecomposition just like PCA

• In other cases gradient descent can be used using the derivativeof SD(z1, z2, ..., zn) and a step α in the following fashion:

1. Begin with a guess for Z2. Repeat until convergence:

Z(k+1) = Z(k) − α∇SD(Z)

URL - Spring 2020 - MAI 41/78

Page 46: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Multidimensional Scaling - Other functions

• Sammong Mapping (emphasis on smaller distances)

SD(z1, z2, ..., zn) =

[∑i 6=i′

(dii′ − ‖zi − zi′‖)2

dii′

]

• Classical Scaling (similarity instead of distance)

SD(z1, z2, ..., zn) =

[∑i 6=i′

(sii′ − 〈zi − z̄, zi′ − z̄〉)2]

• Non metric MDS (assumes a ranking among the distances, noneuclidean space)

SD(z1, z2, ..., zn) =

∑i,i′ [θ(||zi − zi′ ||)− dii′ ]2∑

i,i′ d2i,i′

Page 47: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Random Projection

• A random transformation matrix is generated:• Rectangular matrix N × d

• Columns must have unit length

• Elements are generated from a gaussian distribution

• A matrix generated this way is almost orthogonal

• The projection will preserve the relative distances among pairsof examples

• The Johnson-Lindenstrauss lemma allows to pick a number ofdimensions to obtain the desired approximation

URL - Spring 2020 - MAI 43/78

Page 48: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Nonnegative Matrix Factorization (NMF)

• This formulation assumes that the data is a sum of unknownpositive latent variables

• NMF performs an approximation of a matrix as the product oftwo matrices

V = W ×H

• The main difference with PCA is that the values of the matricesare constrained to be positive

• The positiveness assumption helps to interpret the result• Eg.: In text mining, a document is an aggregation of topics

URL - Spring 2020 - MAI 44/78

Page 49: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Nonlinear scaling

• The previous methods perform a linear transformation betweenthe original space and the final space

• For some datasets this kind of transformation is not enough tomaintain the information of the original data

• Nonlinear transformations methods:• ISOMAP

• Local Linear Embedding

• Local MDS

• t-SNE

URL - Spring 2020 - MAI 45/78

Page 50: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

ISOMAP

• Assumes a low dimensional dataset embedded in a largernumber of dimensions

• The geodesic distance is used instead of the euclidean distance

• The relation of an instance with its immediate neighbors ismore representative of the structure of the data

• The transformation generates a new space that preservesneighborhood relationships

URL - Spring 2020 - MAI 46/78

Page 51: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

ISOMAP

URL - Spring 2020 - MAI 47/78

Page 52: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

ISOMAP

Euclidean

Geodesic

a

b

URL - Spring 2020 - MAI 48/78

Page 53: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

ISOMAP - Algorithm

1. For each data point find its k closest neighbors (points atminimal euclidean distance)

2. Build a graph where each point has an edge to its closestneighbors

3. Approximate the geodesic distance for each pair of points bythe shortest path in the graph

4. Apply a MDS algorithm to the distance matrix of the graph

URL - Spring 2020 - MAI 49/78

Page 54: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

ISOMAP - Example

a

b

a

b

Original Transformed

URL - Spring 2020 - MAI 50/78

Page 55: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Local Linear Embedding

• Performs a transformation that preserves local structure

• Assumes that each instance can be reconstructed by a linearcombination of its neighbors (weights)

• From these weights a new set of data points that preserve thereconstruction is computed for a lower dimensional space

• Different variants of the algorithm exist

URL - Spring 2020 - MAI 51/78

Page 56: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Local Linear Embedding

URL - Spring 2020 - MAI 52/78

Page 57: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Local Linear Embedding - Algorithm

1. For each data point find the K nearest neighbors in the originalspace of dimension p (N (i))

2. Approximate each point by a mixture of the neighbors:

mı́nWik

‖xi −∑k∈N (i)

wikxk‖2

and∑

k∈N (i)wik = 1 and K < p

3. Find points yi in a space of dimension d < p that minimize:

N∑i=0

‖yi −∑k∈N (i)

wikyk‖2

URL - Spring 2020 - MAI 53/78

Page 58: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Local MDS

• Performs a transformation that preserves locality of closerpoints and puts farther away non neighbor points

• Given a set of pairs of points N where a pair (i, i′) belong tothe set if i is among the K neighbors of i′ or viceversa

• Minimize the function:

SL(z1, z2, . . . , zN ) =∑

(i,i′)∈N

(dii′−‖zi−zi′‖)2− τ∑

(i,i′)6∈N

(‖zi−zi′‖)

• The parameters τ controls how much the non neighbors arescattered

URL - Spring 2020 - MAI 54/78

Page 59: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

t-SNE

• t-Stochastic Neighbor Embedding (t-SNE)

• Used as visualization tool

• Assumes distances define a probability distribution

• Obtains a low dimensional space with the closest distribution

• Tricky to use (see this link)

URL - Spring 2020 - MAI 55/78

Page 60: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

t-SNE

• Distances from each example to the rest are scaled to sum one(so it is a probability distribution)

• We want to project the data, so we preserve this probabilitydistribution on a lower dimensionality space

• Examples are distributed in the new space and their distancedistributions are computed

• Examples are iteratively moved to minimize the Kulback-Leiblerdistance among the distribution of the neighbours distances inthe original and in the projected space

URL - Spring 2020 - MAI 56/78

Page 61: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

t-SNE

Each example has a distance probability distribution

Page 62: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

t-SNE

A similarity distribution is obtained

Page 63: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

t-SNE

We distribute the data in a lower dimensionality space

URL - Spring 2020 - MAI 59/78

Page 64: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

t-SNE

Data is moved so the similarity distributions get closer

URL - Spring 2020 - MAI 60/78

Page 65: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control characterization

• Wheelchair with shared control (patient/computer)

• Recorded trajectories of several patients in different situations• Angle/distance to the goal, Angle/distance to the nearest

obstacle from around the chair (210 degrees)

• Characterization about how the computer helps the patientswith different handicaps

• Is there any structure in the trajectory data?

URL - Spring 2020 - MAI 61/78

Page 66: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control characterization

URL - Spring 2020 - MAI 62/78

Page 67: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control characterization

URL - Spring 2020 - MAI 63/78

Page 68: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control (PCA)

URL - Spring 2020 - MAI 64/78

Page 69: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control (SparsePCA)

URL - Spring 2020 - MAI 65/78

Page 70: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control (MDS)

URL - Spring 2020 - MAI 66/78

Page 71: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control (ISOMAPKn=3)

URL - Spring 2020 - MAI 67/78

Page 72: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Application: Wheel chair control (ISOMAPKn=10)

URL - Spring 2020 - MAI 68/78

Page 73: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Unsupervised Attribute Selection

• To eliminate from the dataset all the redundant or irrelevantattributes

• The original attributes are preserved

• Less developed than in Supervised Attribute Selection• Problem: An attribute can be relevant or not depending on the

goal of the discovery process

• There are mainly two techniques for attribute selection:Wrapping and Filtering

URL - Spring 2020 - MAI 69/78

Page 74: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Attribute selection - Wrappers

• A model evaluates the relevance of subsets of attributes

• In supervised learning this is easy, in unsupervised learning it isvery difficult

• Results depend on the chosen model and on how well thismodel captures the actual structure of the data

URL - Spring 2020 - MAI 70/78

Page 75: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Attribute selection - Wrapper Methods

• Clustering algorithms that compute weights for the attributesbased on probability distributions

• Clustering algorithms with an objective function that penalizesthe size of the model

• Consensus clustering

URL - Spring 2020 - MAI 71/78

Page 76: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Attribute selection - Filters

• A measure evaluates the relevance of each attribute individually

• This kind of measures are difficult to obtain for unsupervisedtasks

• The idea is to obtain a measure that evaluates the capacity ofeach attribute to reveal the structure of the data (eg.: classseparability, similarity of instances in the same class)

URL - Spring 2020 - MAI 72/78

Page 77: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Attribute selection - Filter Methods

• Measures of properties of the spatial structure of the data(Entropy, PCA, laplacian matrix)

• Measures of the relevance of the attributes respect the inherentstructure of the data

• Measures of attribute correlation

URL - Spring 2020 - MAI 73/78

Page 78: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Laplacian Score

• The Laplacian Score is a filter method that ranks thefeatures respect to their ability of preserving the naturalstructure of the data.

• This method uses the spectral matrix of the graph computedfrom the near neighbors of the examples

URL - Spring 2020 - MAI 74/78

Page 79: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Laplacian Score

• The Similarity matrix is usually computed using a gaussiankernel (edges not present have a value of 0)

Sij = e||xi−xj ||

2

σ

• The Degree matrix is a diagonal matrix where the elements arethe sum of the rows of S

• The Laplacian matrix is computed as

L = S −D

URL - Spring 2020 - MAI 75/78

Page 80: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Laplacian Score

• The score first computes for each attribute r and their valuesfr the transformation f̃r as:

f̃r = fr −fTr D1

1TD11

• and then the score Lr is computed as:

Lr =f̃Tr Lf̃r

f̃Tr Df̃r

• This gives a ranking for the relevance of the attributes

URL - Spring 2020 - MAI 76/78

Page 81: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Python Notebooks

These two Python Notebooks show some examples dimensionalityreduction and feature selection

• Dimensionality reduction and feature selection Notebook (clickhere to go to the url)

• Linear and non linear dimensionality reduction Notebook (clickhere to go to the url)

If you have downloaded the code from the repository you will able toplay with the notebooks (run jupyter notebook to open thenotebooks)

URL - Spring 2020 - MAI 77/78

Page 82: Data Preprocessing - Computer Science Departmentbejar/URL/material/02-DataPreprocessing.pdf · Data representation Unstructured datasets: Examplesdescribedbyaflatsetofattributes:attribute-value

Python Code

• In the code from the repository inside subdirectoryDimReduction you have the Authors python script

• The code uses the datasets in the directory Data/authors• Auth1 has fragments of books that are novels or philosophy

works• Auth2 has fragments of books written in English and books

translated to English

• The code transforms the text to attribute vectors and appliesdifferent dimensionality reduction algorithms

• Modifying the code you can process one of the datasets andchoose how the text is transformed into vectors

URL - Spring 2020 - MAI 78/78