View
13
Download
0
Category
Preview:
Citation preview
Notion of Distance
Metric DistanceBinary Vector Distances
Tangent Distance
Srihari: CSE 555 1
Distance Measures• Many pattern recognition/data mining techniques are
based on similarity measures between objects• e.g., nearest-neighbor classification, • cluster analysis, • multi-dimensional scaling
• s(i,j): similarity, d(i,j): dissimilarity• Possible transformations:
d(i,j)= 1 – s(i,j) or d(i,j)=sqrt(2*(1-s(i,j))
• Proximity is a general term to indicate similarity and dissimilarity
• Distance is used to indicate dissimilarity
Srihari: CSE 555 2
Euclidean distance
Nearest neighbor classifier relies on a distance function between patterns
The Euclidean formula for distance in d dimensions is
Notion of a metric is far more general
a
b
x3 d = 3
x2
x1
Euclidean Distance between Vectors
( )2/1
1
2),( ⎟⎟⎠
⎞⎜⎜⎝
⎛−= ∑
=
p
kkkE yxyxd
• Euclidean distance assumes variables are commensurate• E.g., each variable a measure of length• If one were weight and other was length there is no
obvious choice of units• Altering units would change which variables are important
x
x1 y1
x2
y2
y
Srihari: CSE 555 3
Srihari: CSE 555 4
Definition of a Metric
• Notion of a metric is more general than Euclidean distance• Properties of a metric
For all vectors a, b and c x3
a
b
d = 3
cx2
Euclidean distance is a metric x1
Srihari: CSE 555 5
Metric Properties
A metric is a dissimilarity (distance)measure that satisfies the following properties:
i j1. d(i,j) > 0 Positivity2. d(i,j) = d(j,i) Commutativity3. d(i,j) < d(i,k) + d(k,j) Triangle Inequality
i
kj
Srihari: CSE 555 6
Scale changescan affect nearest neighbor classifiers based on Euclidean
distanceOriginal space Scaled space with α = 1/3
Point x is closest to black Point x is closest to Red
Rescaling the data to equalize ranges is equivalent to changing the metric in the original space
Srihari: CSE 555 7
Standardizing the Datawhen variables are not commensurate
• Divide each variable by its standard deviation• Standard deviation for the kth variable is
where
Updated value that removes the effect of scale:
21
1
2))((1⎟⎠
⎞⎜⎝
⎛−= ∑
=ikkk ix
nµσ
)(11
ixn
n
ikk ∑
=
=µ
k
kk
xxσ
='
Srihari: CSE 555 8
Weighted Euclidean Distance
• If we know relative importance of variables
21
1
2))()(((),( ⎟⎟⎠
⎞⎜⎜⎝
⎛−= ∑
=
p
kkkkWE jxixwjid
Srihari: CSE 555 9
Distance between Variables
• Similarities between cups
• Suppose we measure cup-height 100 times and diameter only once• height will dominate although 99 of the height
measurements are not contributing anything• They are very highly correlated• To eliminate redundancy we need a data-driven
method• approach is to not only to standardize data in
each direction but also to use covariance between variables
Srihari: CSE 555 10
Sample Covariance between variablesX and Y
Samplemeans
⎟⎠⎞
⎜⎝⎛ −⎟
⎠⎞
⎜⎝⎛ −= ∑
=
_
1
_)()(1),( yiyxix
nYXCov
n
i
• It is a scalar value that measures how X and Y vary together• Obtained by multiplying for each sample its mean-centered value
of x with mean-centered value of y and then adding over all samples
• Large positive value if large values of X tend to be associated with large values of Y and small values of X with small values of Y
• Large negative value if large values of X tend to be associated with small values of Y
• With p variables can construct a p x p matrix of covariances. Such a covariance matrix is symmetric.
Srihari: CSE 555 11
Relationship between Covariance Matrix and Data Matrix
• Let X = n x p data matrix• Rows of X are the data vectors x(i)• Definition of covariance:
• If values of X are mean-centered (i.e., value of each variable is relative to the sample mean of that variable) then V=XTX is the p x p covariance matrix
⎟⎠⎞
⎜⎝⎛ −⎟
⎠⎞
⎜⎝⎛ −= ∑
=
_
1
_
)()(1),( yiyxixn
jiCov k
n
kk
Srihari: CSE 555 12
Correlation CoefficientValue of Covariance is dependent upon ranges of X and Y
Dependency is removed by dividing values of X by their standard deviation and values of Y by their standard deviation
yx
n
iyiyxix
YXσσ
ρ∑
=
−
−−= 1
_))()()((
),(
With p variables, can form a p x p correlation matrix
Srihari: CSE 555 13
Correlation Matrix(Housing related variablesacross city suburbs)
Reference for -1, 0,+1
Variables 3 and 4 are highly negatively correlated with Variable 2
Variable 5 is positively correlated withVariable 11
Variables 8 and 9 are highly correlated
Srihari: CSE 555 14
Generalizing Euclidean Distance
Minkowski or Lλ metric
• λ = 2 gives the Euclidean metric• λ = 1 gives the Manhattan or City-block metric
• λ = infinity yields
( )λλ
1
1
)()( ⎟⎟⎠
⎞⎜⎜⎝
⎛−∑
=
p
kkk jxix
∑=
−p
kkk jxix
1|)()(|
|)()(|max jxix kkk −
Srihari: CSE 555 15
• The Lk norm• L1 norm is the Manhattan (city block) distance• L2 norm is the Euclidean distance
Minkowski Metric
Each colored surfaceconsists of pointsof distance 1.0 from the originUsing different values for kin the Minkowski metric(k is in red)
Origin
ManhattanStreets
Srihari: CSE 555 16
Vector Space Representation of Documents
Document-Term Matrix
t1 database
t2 SQL
t3 index
t4 regression
t5 likelihood
t6 linear
t1 t2 t3 t4 t5 t6D1 24 21 9 0 0 3D2 32 10 5 0 3 0D3 12 16 5 0 0 0D4 6 7 2 0 0 0D5 43 31 20 0 3 0D6 2 0 0 18 7 16D7 0 0 1 32 12 0D8 3 0 0 22 4 2D9 1 0 0 34 27 25D10 6 0 0 17 4 23
Terms usedIn “Database”Related documents
“Regression” Terms
dij represents number of timesthat term appears in that document
Srihari: CSE 555 17
Cosine Distance between Document VectorsDocument-Term Matrix Cosine Distance
t1 t2 t3 t4 t5 t6D1 24 21 9 0 0 3D2 32 10 5 0 3 0D3 12 16 5 0 0 0D4 6 7 2 0 0 0D5 43 31 20 0 3 0D6 2 0 0 18 7 16D7 0 0 1 32 12 0D8 3 0 0 22 4 2D9 1 0 0 34 27 25D10 6 0 0 17 4 23
∑ ∑
∑
= =
==T
k
T
kjkik
T
kjkik
jic
dd
ddDDd
1 1
22
1),(
Cosine of the angle between two vectorsEquivalent to their inner product after each has been normalized to have unit lengthHigher values for more similar vectors
Reflects similarity in terms of relative distributions of components
Cosine is not influenced by one document being small compared to the other (as in the case of Euclidean)
Srihari: CSE 555 18
Euclidean vs Cosine Distance
EuclideanWhite = 0Black =max distance
CosineWhite = LargerCosine(or smaller angle)
Document-Term Matrixt1 t2 t3 t4 t5 t6
D1 24 21 9 0 0 3D2 32 10 5 0 3 0D3 12 16 5 0 0 0D4 6 7 2 0 0 0D5 43 31 20 0 3 0D6 2 0 0 18 7 16D7 0 0 1 32 12 0D8 3 0 0 22 4 2D9 1 0 0 34 27 25D10 6 0 0 17 4 23
Document Number
Document Number
Doc
umen
t Num
ber
Doc
umen
t Num
ber
Both show two clusters of light sub-blocks(database documents and regression documents)Euclidean: 3,4 closer to 6-9 than to 5 since 3,4, 6-9 are closer to origin than 5Cosine emphasizes relative contributions of individual terms
Dat
abas
eR
egre
ssio
n
Srihari: CSE 555 19
Binary-Valued Feature Vectors• Binary vectors are widely used in pattern recognition and
information retrieval• Several distance measures defined some of which are non-
metric
Srihari: CSE 555 20
Distance Measures for Binary Data
• Most obvious measure is Hamming Distance normalized by number of bits
• If we don’t care about irrelevant properties had by neither object we have Jaccard Coefficient
• Dice Coefficient extends this argument. If 00 matches are irrelevant then 10 and 01 matches should have half relevance
00011011
0011
SSSSSS
++++
011011
11
SSSS
++
Srihari: CSE 555 21
Tanimoto Metric
• Used in Taxonomy
• n1 and n2 are the no of elements in S1 and S2
• n12 is the no that is in both sets• Useful when the features are same or different (binary
valued)• 0000 1111 0000 vs 1000 1101 0000
Srihari: CSE 555 22
Similarity/DissimilarityMeasures for Binary Vectors
where
**
*
Srihari: CSE 555 23
Weighted Dissimilarity Measures for Binary Vectors
• Unequal importance to ‘0’ matches and ‘1’ matches
• Multiply S00 with β ([0,1])• Examples:
1 0011
NSS(X,Y)Dsm ⋅+
−=β
2
)(2),(0011
0011
SSNSSNYXDrta
⋅−−⋅−−
=ββ
Srihari: CSE 555 24
k-nn digit recognition accuracy• Sokal-Michener ( β =0.35) 99.38%• Rogers-Tanimoto
Correlation 99.36% Jaccard-Needham Dice
• Yule 99.28%• Kulzinsky 99.03% • Russell-Rao 97.56%
Speed Issues
k-nn Reduction Rules
Srihari: CSE 555 25
Rule 1
Rule 2
distance to nth cluster center
pattern to be classified
current nearest pattern
Radius of nth
cluster center
If
+ <L Rnd(Z,Hn)
then no member of Tn need be considered
IfL + d(Xi,Hn) < d(Z,Hn)
then Xi cannot be the nearest neighbor
Srihari: CSE 555 26
Tri-Edge Inequality (TEI)N-dimensional binary space ΩΘ is a subspace of ΩD(·, ·) is a dissimilarity measure
TEI for D(·, ·) over Θ means that the sum of any two dissimilarities (edges) is equal to or bigger than any single dissimilarity:
,,,,,,, VUandYXandQPVUYX ≠≠Θ∈∀
),(),(),( QPDVUDYXD ≥+
Note: here three edges, D(X,Y), D(U,V) and D(P,Q), may not originate from a triangle, thus TEI is different from the Tri-Angle Inequality.
Srihari: CSE 555 27
Tri Edge Inequality in 3-d Binary Space
TEI holds ( for all edges): min(D(,))=1 and max(D(,)) = , thus,
2 * min(D(,)) > max(D(,))
3
TEI holds for yellow edge, not for blue:
Hamming Distance Euclidean Distance
>(1) + (1) (1)
< (3)2 * 1 > 3
Srihari: CSE 555 28
TEI Property with Binary Vector Dissimilarity Measures
Srihari: CSE 555 29
Test for TEIDefinition:
torcolumn vecbinary ldimensiona-Nan is | Ω XX=
whereΩ⊆Θ
, |||| , ,|),( MYXXyXXM t ≤⋅≤Θ∈Ω∈=Θ λλThe Test:
Srihari: CSE 555 30
Summary of Binary Vector measures
• TEI is a property of the dissimilarity measure for binary vectorclassification
• It holds with three binary vector dissimilarity measures• TEI test defined and experimentally verified• Reduction rules of fast k-nn algorithms are inapplicable to
dissimilarity measures violating TEI
Srihari: CSE 555 31
Tangent Distance
Which is more similar to test pattern, A or B?
Euclidean distance: sum of squares of pixel to pixel difference
According to Euclidean distance, Prototype B is more similar
Even though prototype A is more similar onceit has been rotated and thickened
Srihari: CSE 555 32
Effect of Translation on Euclidean Distance• Pattern vector: 10x10 (d=100) grey values
When shift is largeDistance between 5’s is more thanDistance between5 and 8
(Amount of shift)
Dis
tanc
e
EuclideanDistance is notTranslationinvariant
• Suppose there are r possible transformations: • Horizontal (x) translation • Vertical (y) translation • Shear • Rotation • Scale• Thinning
• For each prototype x’ perform each transformation Fi(x’;αi)• E.g., Fi(x’;αi) represents x’ rotated by αi
Transformations
αi=15o
Srihari: CSE 555 33
Transformation that depends on one parameter α(rotation angle)
inputs of spacein vector curve ldimensiona-one a is),(such that | patterns d tranformeall ofSet
parameter oneon dependsthat ation transforma with ed transformis PPattern PsxxS
s
P ααα
=∃=
Small rotationsof originaldigit 3
Effect of therotations inpixel space(if there were only3 pixels)
Srihari: CSE 555 34
When there are n parameters αi (rotation, scaling, etc) SP is a manifold of at most n dimensions
Srihari: CSE 555 35
Learning from a limited set of samples
Fitted curve with onlyconstraint that it goes throughthe samples: a poor fit
x1 x2 x3 x4
Fitted curve with constraint that it goes through the samples and itsDerivatives: a better fit
x1 x2 x3 x4
Srihari: CSE 555 36
Tangent Vector
Small Transformations of P can be obtained by adding to Pa linear combination of vectors that span the tangent plane (tangent vectors)
Tangent vectors for a transformation s can be easily computed byfinite difference, by evaluating δ s(P,α)/δ α
Srihari: CSE 555 37
Transformation that depends on one parameter α (angle of rotation)
=
Small rotationsof originaldigit 3
Effect of therotations inpixel space(if there were only3 pixels)
Images obtained by moving along the tangent to the transformation curve for the same originaldigitized image P by adding various amounts (α) of the tangent vector TV
SameOriginalImage
Srihari: CSE 555 38
Tangent Vector
• Construct a Tangent vector TVi for each transformation:
• Assume x’ is represented by a d-dimensional vector• For each prototype x’ construct an r x d matrix T consisting of the tangent vectors at x’
• If there are 2 transformations and a 256-element vector (16 x 16 pixels) is used then there are two tangent vector images
• The tangent vectors are computed at training time
Srihari: CSE 555 39
Algorithm for computing tangent vectors
• T(a,u) denotes the input u rotated by α• By definition the tangent to the transformation is given by
• This approach is not accurate, also t(ε,u) for 2-d images requires a 2-d interpolation which is expensive
εε
αα
α
),0(),(
),(0
ututT
utT
u
u
−≈
∂∂
==
Srihari: CSE 555 40
Tangent Vectorfor 2-D images
• Let u(x,y) be an image• T be a transformation that
displaces the coordinates X and Y of each pixel by Dxand Dy and smoothes it by convolving it with a Gaussian g
Tangent Vector( ) ( )
( )yyxx
yx
yx
SDSDSyguD
xguDguut
yguD
xguDguut
++=
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
∗+∂∂
∗+∗≈
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∗∂
+∂∗∂
+∗≈
α
αα
αα
),(
)()(),(
TangentVectorTV2(thinning)
Prototype subjected to Rotation and Line Thinning yields two tangent vectors TV1 and TV2Computed at training time
Tangent VectorTV1(rotation)
Sixteen Imagesobtained from Linear Combinations ofprototypeand twoTangent vectorsTV1 and TV2with coeffts a1and a2
Gray value means negativePixel value
Euclidean Distance betweenTangent approximation(x’ + a1TV1 + a2TV2)
and image generatedby the un-approximatedtransformation Fi(x’,αi)
Expanding a prototype using its tangent vectors
Srihari: CSE 555 41
Srihari: CSE 555 42
Euclidean Distance, Tangent Distance, Manifold Distance
D (E,P)
D(E,P)
Lines do notIntersect inthree dimensions
Srihari: CSE 555 43
Tangent Distance:from test point x to prototype x’
• Given a matrix T consisting of the r tangent vectors at x’ the tangent distance from x to x’ is
• i.e., the Euclidean distance from x to the tangent space of x’• This is the one-sided tangent distance since only one pattern x’ is
transformed• Two-sided tangent distance improves accuracy only slightly at added
computational burden• During classification of x we find its tangent distance to x’ by finding
the optimizing value of a required by above equation• Minimization is simple since squared distance we want to find is a
quadratic function of a
[ ]xTaxxxDa
−+= )'(min)',(tan
Srihari: CSE 555 44
Minimizing value a for Tangent Distance
Stored prototypex’ falls on thismanifold when
subjected totransformations
Tangent Space at x’ is an r-dimensional Euclidean space spanned byTangent vectors TV1 and TV2
In Euclideanspace x1 is closerto x’ than x2
In Tangent spacex2 is closer to x’than x1
Pink paraboloid is a quadratic function of the parameter vector a
Gradient descent is usedto calculateTangent distance Dtan(x’, x2)
Recommended