Notion of Distance - University at Buffalosrihari/CSE555/Chap4.Metrics-TangentDistance.pdf• Notion...

Notion of Distance

Metric DistanceBinary Vector Distances

Tangent Distance

Srihari: CSE 555 1

Distance Measures• Many pattern recognition/data mining techniques are

based on similarity measures between objects• e.g., nearest-neighbor classification, • cluster analysis, • multi-dimensional scaling

• s(i,j): similarity, d(i,j): dissimilarity• Possible transformations:

d(i,j)= 1 – s(i,j) or d(i,j)=sqrt(2*(1-s(i,j))

• Proximity is a general term to indicate similarity and dissimilarity

• Distance is used to indicate dissimilarity

Srihari: CSE 555 2

Euclidean distance

Nearest neighbor classifier relies on a distance function between patterns

The Euclidean formula for distance in d dimensions is

Notion of a metric is far more general

x3 d = 3

Euclidean Distance between Vectors

( )2/1

2),( ⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑

kkkE yxyxd

• Euclidean distance assumes variables are commensurate• E.g., each variable a measure of length• If one were weight and other was length there is no

obvious choice of units• Altering units would change which variables are important

Srihari: CSE 555 3

Srihari: CSE 555 4

Definition of a Metric

• Notion of a metric is more general than Euclidean distance• Properties of a metric

For all vectors a, b and c x3

Euclidean distance is a metric x1

Srihari: CSE 555 5

Metric Properties

A metric is a dissimilarity (distance)measure that satisfies the following properties:

i j1. d(i,j) > 0 Positivity2. d(i,j) = d(j,i) Commutativity3. d(i,j) < d(i,k) + d(k,j) Triangle Inequality

Srihari: CSE 555 6

Scale changescan affect nearest neighbor classifiers based on Euclidean

distanceOriginal space Scaled space with α = 1/3

Point x is closest to black Point x is closest to Red

Rescaling the data to equalize ranges is equivalent to changing the metric in the original space

Srihari: CSE 555 7

Standardizing the Datawhen variables are not commensurate

• Divide each variable by its standard deviation• Standard deviation for the kth variable is

Updated value that removes the effect of scale:

2))((1⎟⎠

⎞⎜⎝

⎛−= ∑

=ikkk ix

ikk ∑

Srihari: CSE 555 8

Weighted Euclidean Distance

• If we know relative importance of variables

2))()(((),( ⎟⎟⎠

⎞⎜⎜⎝

⎛−= ∑

kkkkWE jxixwjid

Srihari: CSE 555 9

Distance between Variables

• Similarities between cups

• Suppose we measure cup-height 100 times and diameter only once• height will dominate although 99 of the height

measurements are not contributing anything• They are very highly correlated• To eliminate redundancy we need a data-driven

method• approach is to not only to standardize data in

each direction but also to use covariance between variables

Srihari: CSE 555 10

Sample Covariance between variablesX and Y

Samplemeans

⎟⎠⎞

⎜⎝⎛ −⎟

⎠⎞

⎜⎝⎛ −= ∑

_)()(1),( yiyxix

nYXCov

• It is a scalar value that measures how X and Y vary together• Obtained by multiplying for each sample its mean-centered value

of x with mean-centered value of y and then adding over all samples

• Large positive value if large values of X tend to be associated with large values of Y and small values of X with small values of Y

• Large negative value if large values of X tend to be associated with small values of Y

• With p variables can construct a p x p matrix of covariances. Such a covariance matrix is symmetric.

Srihari: CSE 555 11

Relationship between Covariance Matrix and Data Matrix

• Let X = n x p data matrix• Rows of X are the data vectors x(i)• Definition of covariance:

• If values of X are mean-centered (i.e., value of each variable is relative to the sample mean of that variable) then V=XTX is the p x p covariance matrix

⎟⎠⎞

⎜⎝⎛ −⎟

⎠⎞

⎜⎝⎛ −= ∑

)()(1),( yiyxixn

jiCov k

Srihari: CSE 555 12

Correlation CoefficientValue of Covariance is dependent upon ranges of X and Y

Dependency is removed by dividing values of X by their standard deviation and values of Y by their standard deviation

iyiyxix

YXσσ

−−= 1

_))()()((

With p variables, can form a p x p correlation matrix

Srihari: CSE 555 13

Correlation Matrix(Housing related variablesacross city suburbs)

Reference for -1, 0,+1

Variables 3 and 4 are highly negatively correlated with Variable 2

Variable 5 is positively correlated withVariable 11

Variables 8 and 9 are highly correlated

Srihari: CSE 555 14

Generalizing Euclidean Distance

Minkowski or Lλ metric

• λ = 2 gives the Euclidean metric• λ = 1 gives the Manhattan or City-block metric

• λ = infinity yields

( )λλ

)()( ⎟⎟⎠

⎞⎜⎜⎝

⎛−∑

kkk jxix

1|)()(|

|)()(|max jxix kkk −

Srihari: CSE 555 15

• The Lk norm• L1 norm is the Manhattan (city block) distance• L2 norm is the Euclidean distance

Minkowski Metric

Each colored surfaceconsists of pointsof distance 1.0 from the originUsing different values for kin the Minkowski metric(k is in red)

Origin

ManhattanStreets

Srihari: CSE 555 16

Vector Space Representation of Documents

Document-Term Matrix

t1 database

t2 SQL

t3 index

t4 regression

t5 likelihood

t6 linear

t1 t2 t3 t4 t5 t6D1 24 21 9 0 0 3D2 32 10 5 0 3 0D3 12 16 5 0 0 0D4 6 7 2 0 0 0D5 43 31 20 0 3 0D6 2 0 0 18 7 16D7 0 0 1 32 12 0D8 3 0 0 22 4 2D9 1 0 0 34 27 25D10 6 0 0 17 4 23

Terms usedIn “Database”Related documents

“Regression” Terms

dij represents number of timesthat term appears in that document

Srihari: CSE 555 17

Cosine Distance between Document VectorsDocument-Term Matrix Cosine Distance

t1 t2 t3 t4 t5 t6D1 24 21 9 0 0 3D2 32 10 5 0 3 0D3 12 16 5 0 0 0D4 6 7 2 0 0 0D5 43 31 20 0 3 0D6 2 0 0 18 7 16D7 0 0 1 32 12 0D8 3 0 0 22 4 2D9 1 0 0 34 27 25D10 6 0 0 17 4 23

∑ ∑

Cosine of the angle between two vectorsEquivalent to their inner product after each has been normalized to have unit lengthHigher values for more similar vectors

Reflects similarity in terms of relative distributions of components

Cosine is not influenced by one document being small compared to the other (as in the case of Euclidean)

Srihari: CSE 555 18

Euclidean vs Cosine Distance

EuclideanWhite = 0Black =max distance

CosineWhite = LargerCosine(or smaller angle)

Document-Term Matrixt1 t2 t3 t4 t5 t6

D1 24 21 9 0 0 3D2 32 10 5 0 3 0D3 12 16 5 0 0 0D4 6 7 2 0 0 0D5 43 31 20 0 3 0D6 2 0 0 18 7 16D7 0 0 1 32 12 0D8 3 0 0 22 4 2D9 1 0 0 34 27 25D10 6 0 0 17 4 23

Document Number

Both show two clusters of light sub-blocks(database documents and regression documents)Euclidean: 3,4 closer to 6-9 than to 5 since 3,4, 6-9 are closer to origin than 5Cosine emphasizes relative contributions of individual terms

Srihari: CSE 555 19

Binary-Valued Feature Vectors• Binary vectors are widely used in pattern recognition and

information retrieval• Several distance measures defined some of which are non-

metric

Srihari: CSE 555 20

Distance Measures for Binary Data

• Most obvious measure is Hamming Distance normalized by number of bits

• If we don’t care about irrelevant properties had by neither object we have Jaccard Coefficient

• Dice Coefficient extends this argument. If 00 matches are irrelevant then 10 and 01 matches should have half relevance

00011011

SSSSSS

011011

Srihari: CSE 555 21

Tanimoto Metric

• Used in Taxonomy

• n1 and n2 are the no of elements in S1 and S2

• n12 is the no that is in both sets• Useful when the features are same or different (binary

valued)• 0000 1111 0000 vs 1000 1101 0000

Srihari: CSE 555 22

Similarity/DissimilarityMeasures for Binary Vectors

Srihari: CSE 555 23

Weighted Dissimilarity Measures for Binary Vectors

• Unequal importance to ‘0’ matches and ‘1’ matches

• Multiply S00 with β ([0,1])• Examples:

1 0011

NSS(X,Y)Dsm ⋅+

−=β

)(2),(0011

SSNSSNYXDrta

⋅−−⋅−−

Srihari: CSE 555 24

k-nn digit recognition accuracy• Sokal-Michener ( β =0.35) 99.38%• Rogers-Tanimoto

Correlation 99.36% Jaccard-Needham Dice

• Yule 99.28%• Kulzinsky 99.03% • Russell-Rao 97.56%

Speed Issues

k-nn Reduction Rules

Srihari: CSE 555 25

Rule 1

Rule 2

distance to nth cluster center

pattern to be classified

current nearest pattern

Radius of nth

cluster center

+ <L Rnd(Z,Hn)

then no member of Tn need be considered

IfL + d(Xi,Hn) < d(Z,Hn)

then Xi cannot be the nearest neighbor

Srihari: CSE 555 26

Tri-Edge Inequality (TEI)N-dimensional binary space ΩΘ is a subspace of ΩD(·, ·) is a dissimilarity measure

TEI for D(·, ·) over Θ means that the sum of any two dissimilarities (edges) is equal to or bigger than any single dissimilarity:

,,,,,,, VUandYXandQPVUYX ≠≠Θ∈∀

),(),(),( QPDVUDYXD ≥+

Note: here three edges, D(X,Y), D(U,V) and D(P,Q), may not originate from a triangle, thus TEI is different from the Tri-Angle Inequality.

Srihari: CSE 555 27

Tri Edge Inequality in 3-d Binary Space

TEI holds ( for all edges): min(D(,))=1 and max(D(,)) = , thus,

2 * min(D(,)) > max(D(,))

TEI holds for yellow edge, not for blue:

Hamming Distance Euclidean Distance

>(1) + (1) (1)

< (3)2 * 1 > 3

Srihari: CSE 555 28

TEI Property with Binary Vector Dissimilarity Measures

Srihari: CSE 555 29

Test for TEIDefinition:

torcolumn vecbinary ldimensiona-Nan is | Ω XX=

whereΩ⊆Θ

, |||| , ,|),( MYXXyXXM t ≤⋅≤Θ∈Ω∈=Θ λλThe Test:

Srihari: CSE 555 30

Summary of Binary Vector measures

• TEI is a property of the dissimilarity measure for binary vectorclassification

• It holds with three binary vector dissimilarity measures• TEI test defined and experimentally verified• Reduction rules of fast k-nn algorithms are inapplicable to

dissimilarity measures violating TEI

Srihari: CSE 555 31

Tangent Distance

Which is more similar to test pattern, A or B?

Euclidean distance: sum of squares of pixel to pixel difference

According to Euclidean distance, Prototype B is more similar

Even though prototype A is more similar onceit has been rotated and thickened

Srihari: CSE 555 32

Effect of Translation on Euclidean Distance• Pattern vector: 10x10 (d=100) grey values

When shift is largeDistance between 5’s is more thanDistance between5 and 8

(Amount of shift)

EuclideanDistance is notTranslationinvariant

• Suppose there are r possible transformations: • Horizontal (x) translation • Vertical (y) translation • Shear • Rotation • Scale• Thinning

• For each prototype x’ perform each transformation Fi(x’;αi)• E.g., Fi(x’;αi) represents x’ rotated by αi

Transformations

αi=15o

Srihari: CSE 555 33

Transformation that depends on one parameter α(rotation angle)

inputs of spacein vector curve ldimensiona-one a is),(such that | patterns d tranformeall ofSet

parameter oneon dependsthat ation transforma with ed transformis PPattern PsxxS

P ααα

Small rotationsof originaldigit 3

Effect of therotations inpixel space(if there were only3 pixels)

Srihari: CSE 555 34

When there are n parameters αi (rotation, scaling, etc) SP is a manifold of at most n dimensions

Srihari: CSE 555 35

Learning from a limited set of samples

Fitted curve with onlyconstraint that it goes throughthe samples: a poor fit

x1 x2 x3 x4

Fitted curve with constraint that it goes through the samples and itsDerivatives: a better fit

x1 x2 x3 x4

Srihari: CSE 555 36

Tangent Vector

Small Transformations of P can be obtained by adding to Pa linear combination of vectors that span the tangent plane (tangent vectors)

Tangent vectors for a transformation s can be easily computed byfinite difference, by evaluating δ s(P,α)/δ α

Srihari: CSE 555 37

Transformation that depends on one parameter α (angle of rotation)

Small rotationsof originaldigit 3

Effect of therotations inpixel space(if there were only3 pixels)

Images obtained by moving along the tangent to the transformation curve for the same originaldigitized image P by adding various amounts (α) of the tangent vector TV

SameOriginalImage

Srihari: CSE 555 38

Tangent Vector

• Construct a Tangent vector TVi for each transformation:

• Assume x’ is represented by a d-dimensional vector• For each prototype x’ construct an r x d matrix T consisting of the tangent vectors at x’

• If there are 2 transformations and a 256-element vector (16 x 16 pixels) is used then there are two tangent vector images

• The tangent vectors are computed at training time

Srihari: CSE 555 39

Algorithm for computing tangent vectors

• T(a,u) denotes the input u rotated by α• By definition the tangent to the transformation is given by

• This approach is not accurate, also t(ε,u) for 2-d images requires a 2-d interpolation which is expensive

),0(),(

−≈

∂∂

Srihari: CSE 555 40

Tangent Vectorfor 2-D images

• Let u(x,y) be an image• T be a transformation that

displaces the coordinates X and Y of each pixel by Dxand Dy and smoothes it by convolving it with a Gaussian g

Tangent Vector( ) ( )

( )yyxx

SDSDSyguD

xguDguut

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

∗+∂∂

∗+∗≈

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∗∂

+∂∗∂

+∗≈

)()(),(

TangentVectorTV2(thinning)

Prototype subjected to Rotation and Line Thinning yields two tangent vectors TV1 and TV2Computed at training time

Tangent VectorTV1(rotation)

Sixteen Imagesobtained from Linear Combinations ofprototypeand twoTangent vectorsTV1 and TV2with coeffts a1and a2

Gray value means negativePixel value

Euclidean Distance betweenTangent approximation(x’ + a1TV1 + a2TV2)

and image generatedby the un-approximatedtransformation Fi(x’,αi)

Expanding a prototype using its tangent vectors

Srihari: CSE 555 41

Srihari: CSE 555 42

Euclidean Distance, Tangent Distance, Manifold Distance

D (E,P)

D(E,P)

Lines do notIntersect inthree dimensions

Srihari: CSE 555 43

Tangent Distance:from test point x to prototype x’

• Given a matrix T consisting of the r tangent vectors at x’ the tangent distance from x to x’ is

• i.e., the Euclidean distance from x to the tangent space of x’• This is the one-sided tangent distance since only one pattern x’ is

transformed• Two-sided tangent distance improves accuracy only slightly at added

computational burden• During classification of x we find its tangent distance to x’ by finding

the optimizing value of a required by above equation• Minimization is simple since squared distance we want to find is a

quadratic function of a

[ ]xTaxxxDa

−+= )'(min)',(tan

Srihari: CSE 555 44

Minimizing value a for Tangent Distance

Stored prototypex’ falls on thismanifold when

subjected totransformations

Tangent Space at x’ is an r-dimensional Euclidean space spanned byTangent vectors TV1 and TV2

In Euclideanspace x1 is closerto x’ than x2

In Tangent spacex2 is closer to x’than x1

Pink paraboloid is a quadratic function of the parameter vector a

Gradient descent is usedto calculateTangent distance Dtan(x’, x2)

Notion of Distance - University at Buffalosrihari/CSE555/Chap4.Metrics-TangentDistance.pdf• Notion...

Documents

ECCV2010: distance function and metric learning part 2

Learning Instance Specific Distance Using Metric Propagation

DISTANCE METRIC - SJTU

Measuring Length and Distance in Metric Units. Some of the tools used to measure length and distance are a metric ruler, a meter stick and a metric tape

Distance Metric Learning: A Comprehensive Surveyrongjin/semisupervised/dist-metric-survey.pdf · distance metric learning methods is presented in section 4. In section 5, we will

D cache universal distance cache for metric access methods.bak

Distance metric based computation using P-trees A new distance metric, called HOBbit distance

Claudia de Rham May 9 th 2012. Massive Gravity The notion of mass requires a reference ! Flat Metric Metric

A Modified Metric to Compute Distance

Distance Metric Learning for Set-based Visual Recognition

A distance metric for a space of linguistic summaries

Hamming Metric and the Minimum Distance

A Boosted Distance Metric: Application to Content Based Image

Online Distance Metric Learning for Object Trackingusers.ics.forth.gr/~greg/Docs/OnlineDistanceMetric... · 2016-10-14 · 3. Distance metric learning is overviewed in Section 4,

Nonlinear Adaptive Distance Metric Learning for Clustering

Administrative Distance & Metric

Distance Metric Learning with Joint Representation Diversification · 2020. 7. 2. · Distance Metric Learning with Joint Representation Diversiﬁcation Xu Chu 1 2Yang Lin Yasha

Distance Metric Learning for Large Margin Nearest Neighbor

Distance Metric Learning

A belief function distance metric for orderable sets