View
11
Download
0
Category
Preview:
Citation preview
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Knowledge Discovery and Data Mining 1 (VO) (706.701)Data Matrices and Vector Space Model
Denis Helic
ISDS, TU Graz
October 30, 2018
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 1 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Big picture: KDDM
Knowledge Discovery Process
PreprocessingTransformation
Linear Algebra
Mathematical Tools
Information Theory Statistical Inference
Probability Theory Hardware & ProgrammingModel
Infrastructure
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 2 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Outline
1 Recap
2 Data Representation
3 Data Matrix
4 Vector Space Model: Document-Term Matrix
5 Recommender Systems: Utility Matrix
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 3 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recap
Recap
RecapReview of the preprocessing, feature extraction and selection phase
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 4 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recap
Recap – Preprocessing
Initial phase of the Knowledge Discovery process... acquire the data to be analyzede.g. by crawling the data from the Web... prepare the datae.g. by cleaning and removing outliers
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 5 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recap
Recap – Feature Extraction
Example of features:Images → colors, textures, contours, ...Signals → frequency, phase, samples, spectrum, ...Time series → ticks, trends, self-similarities, ...Biomed → dna sequence, genes, ...Text → words, POS tags, grammatical dependencies, ...
Features encode these properties in a way suitable for a chosen algorithm
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 6 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recap
Recap – Feature Selection
Approach: select the sub-set of all features without redundant orirrelevant features
Unsupervised, e.g. heuristics (black & white lists, score for ranking, …)Supervised, e.g. using a training data set (calculate measure to assesshow discriminative is a feature, e.g. information gain)Penalty for more features, e.g. regularization
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 7 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recap
Recap – Text mining
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 8 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Representation
Representing data
Once when we have the features that we want: how can we representthem?Two representations:
1 Mathematical (model, analyze, and reason about data and algorithmsin a formal way)
2 Representation in computers (data structures, implementation,performance)
In this course we concentrate on mathematical modelsKDDM2 is about second representation
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 9 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Representation
Representing data
Given: Preprocessed data objects as a set of featuresE.g. for text documents set of words, bigrams, n-grams, …Given: Feature statistics for each data objectE.g. number of occurrences, magnitudes, ticks, …Find: Mathematical model for calculationsE.g. similarity, distance, add, subtract, transform, …
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 10 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Representation
Representing data
Let us think (geometrically) about features as dimensions(coordinates) in an 𝑚-dimensional spaceFor two features we have 2D-space, for three features we have3D-space, and so onFor numeric features the feature statistics will be numbers comingfrom e.g. ℝThen each data object is a point in the 𝑚-dimensional spaceThe value of a given feature is then the value for its coordinate in this𝑚-dimensional space
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 11 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Representation
Representing data: Example
Text documentsSuppose we have the following documents:
DocID Documentd1 Chinese Chinesed2 Chinese Chinese Tokyo Tokyo Beijing Beijing Beijingd3 Chinese Tokyo Tokyo Tokyo Beijing Beijingd4 Tokyo Tokyo Tokyod5 Tokyo
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 12 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Representation
Representing data: Example
Now, we take words as featuresWe take word occurrences as the feature valuesThree features: Beijing, Chinese, Tokyo
DocFeature Beijing Chinese Tokyo
d1 0 2 0d2 3 2 2d3 2 1 3d4 0 0 3d5 0 0 1
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 13 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Representation
Representing data: Example
x(Beijing)
01
2
3
y(Chinese)
0
1
2
3
z(Tokyo
)
0
1
2
3
d1:(0,2,0)
d2:(3,2,2)
d3:(2,1,3)d4:(0,0,3)
d5:(0,0,1)
Geometric representation of data
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 14 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Matrix
Representing data
Now, an intuitive representation of the data is a matrixIn a general caseColumns correspond to features, i.e. dimensions or coordinates in an𝑚-dimensional spaceRows correspond to data objects, i.e. data pointsAn element 𝑑𝑖𝑗 in the 𝑖-th row and the 𝑗-th column is the 𝑗-thcoordinate of the 𝑖-th data pointThe data is represented by a matrix D ∈ ℝ𝑛×𝑚, where 𝑛 is thenumber of data points and 𝑚 the number of features
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 15 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Matrix
Representing data
For textColumns correspond to terms, words, and so onI.e. each word is a dimension or a coordinate in an 𝑚-dimensionalspaceRows correspond to documentsAn element 𝑑𝑖𝑗 in the 𝑖-th row and the 𝑗-th column is the e.g. numberof occurrences of the 𝑗-th word in the 𝑖-th documentThis matrix is called document-term matrix
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 16 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Data Matrix
Representing data: example
D =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0 2 03 2 22 1 30 0 30 0 1
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 17 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector interpretation
Alternative and very useful interpretation of data matrices is in termsof vectorsEach data point, i.e. each row in the data matrix can be interpretedas a vector in an 𝑚-dimensional spaceThis allows us to work with:
1 the vector direction, i.e. the line connecting the origin and a given datapoint
2 the vector magnitude (length), i.e. the distance from the origin and agiven data point → feature weighting
3 similarity, distance, correlations between vectors4 manipulate vectors, and so on.
Vector space model
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 18 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector interpretation: Example
x(Beijing)
01
2
3
y(Chinese)
0
1
2
3
z(Tokyo
)
0
1
2
3
d1:(0,2,0)
d2:(3,2,2)
d3:(2,1,3)d4:(0,0,3)
d5:(0,0,1)
Geometric representation of data
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 19 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: feature weighting
What we did so far is that we counted word occurrences and usedthese counts as the coordinates in a given dimensionThis weighting scheme is referred to as term frequency (TF)Three features: Beijing, Chinese, Tokyo
DocFeature Beijing Chinese Tokyo
d1 0 2 0d2 3 2 2d3 2 1 3d4 0 0 3d5 0 0 1
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 20 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: TF
Please note the difference between the vector space model with TFweighting and e.g. Standard boolean model in information retrievalStandard boolean model uses binary feature valuesIf the feature 𝑗 occurs in the document 𝑖 (regardless how many times)then 𝑑𝑖𝑗 = 1Otherwise 𝑑𝑖𝑗 = 0Allows very simple manipulation of the model, but does not capturethe relative importance of the feature 𝑗 for the document 𝑖If a feature occurs more frequently in a document then it is moreimportant for that document and TF captures this intuition
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 21 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: feature weighting
However, raw TF has a critical problem: all terms are consideredequally important for assessing relevanceIn fact, certain terms have no discriminative power at all consideringrelevanceFor example, a collection of documents on the auto industry will havemost probably term “auto” in every documentWe need to penalize terms which occur too often in the completecollectionWe start with the notion of document frequency (DF) of a termThis is the number of documents that contain a given term
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 22 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: IDF
If a term is discriminative then it will only appear in a small numberof documentsI.e. its DF will be low, or its inverse document frequency (IDF) willbe highIn practice we typically calculate IDF as:
𝐼𝐷𝐹𝑗 = 𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗
),
where 𝑁 is the number of documents and 𝐷𝐹𝑗 is the documentfrequency of the term 𝑗.
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 23 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: TFxIDF
Finally, we combine TF and IDF to produce a composite weight for agiven feature
𝑇𝐹𝑥𝐼𝐷𝐹𝑖𝑗 = 𝑇𝐹𝑖𝑗𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗
)
Now, this quantity will be high if the feature 𝑗 occurs only a couple oftimes in the whole collection, and most of these occurrences are inthe document 𝑖
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 24 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
Text documentsSuppose we have the following documents:
DocID Documentd1 Chinese Chinesed2 Chinese Chinese Tokyo Tokyo Beijing Beijing Beijingd3 Chinese Tokyo Tokyo Tokyo Beijing Beijingd4 Tokyo Tokyo Tokyod5 Tokyo
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 25 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
Three features: Beijing, Chinese, TokyoTF:
DocFeature Beijing Chinese Tokyo
d1 0 2 0d2 3 2 2d3 2 1 3d4 0 0 3d5 0 0 1
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 26 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
Three features: Beijing, Chinese, TokyoDF:
𝐷𝐹𝐵𝑒𝑖𝑗𝑖𝑛𝑔 = 2𝐷𝐹𝐶ℎ𝑖𝑛𝑒𝑠𝑒 = 3
𝐷𝐹𝑇𝑜𝑘𝑦𝑜 = 4
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 27 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
Three features: Beijing, Chinese, TokyoIDF:
𝐼𝐷𝐹𝑗 = 𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗
)
𝐼𝐷𝐹𝐵𝑒𝑖𝑗𝑖𝑛𝑔 = 0.3979𝐼𝐷𝐹𝐶ℎ𝑖𝑛𝑒𝑠𝑒 = 0.2218
𝐼𝐷𝐹𝑇𝑜𝑘𝑦𝑜 = 0.0969
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 28 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
Three features: Beijing, Chinese, TokyoTFxIDF:
𝑇𝐹𝑥𝐼𝐷𝐹𝑖𝑗 = 𝑇𝐹𝑖𝑗𝑙𝑜𝑔( 𝑁𝐷𝐹𝑗
)
DocFeature Beijing Chinese Tokyo
d1 0 0.4436 0d2 1.1937 0.4436 0.1938d3 0.7958 0.2218 0.2907d4 0 0 0.2907d5 0 0 0.0969
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 29 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
x(Beijing)
0.00.2
0.40.6
0.81.0
y(Chinese)
0.00.2
0.40.6
0.81.0
z(Tokyo
)
0.0
0.2
0.4
0.6
0.8
1.0
d1
d2d3
d4
d5
Geometric representation of data
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 30 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
TFxIDF: Example
x(Beijing)
01
2
3
y(Chinese)
0
1
2
3
z(Tokyo
)
0
1
2
3
d1:(0,2,0)
d2:(3,2,2)
d3:(2,1,3)d4:(0,0,3)
d5:(0,0,1)
Geometric representation of data
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 31 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: TFxIDF
Further TF variants are possibleSub-linear TF scaling, e.g. √𝑇𝐹 or 1 + 𝑙𝑜𝑔(𝑇𝐹)Unlikely that e.g. 20 occurrences of a single term carry 20 times theweight of a single occurrenceAlso very often TF-normalization by e.g. dividing by the maximal TFin the collection
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 32 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: vector manipulation
Vector addition: concatenate two documentsVector subtraction: diff two documentsTransform the vector space in a new vector space: projection andfeature selectionSimilarity calculationsDistance calculationsScaling for visualization
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 33 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Vector space model: similarity and distance
Similarity and distance allow us to order documentsThis is useful in many applications:
1 Relevance ranking (information retrieval)2 Classification3 Clustering4 Transformation and projection
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 34 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Properties of similarity and distance
DefinitionLet 𝐷 be the set of all data objects, e.g. all documents. Similarity anddistance are functions 𝑠𝑖𝑚 ∶ 𝐷𝑥𝐷 → ℝ, 𝑑𝑖𝑠𝑡 ∶ 𝐷𝑥𝐷 → ℝ with thefollowing properties:
Symmetry
𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) = 𝑠𝑖𝑚(𝑑𝑗, 𝑑𝑖)𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑗) = 𝑑𝑖𝑠𝑡(𝑑𝑗, 𝑑𝑖)
Self-similarity (self-distance)
𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) ≤ 𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑖) = 𝑠𝑖𝑚(𝑑𝑗, 𝑑𝑗)𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑗) ≥ 𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑖) = 𝑑𝑖𝑠𝑡(𝑑𝑗, 𝑑𝑗) = 0
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 35 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Properties of similarity and distance
Additional propertiesSome additional properties of the 𝑠𝑖𝑚 function:
Normalization of similarity to the interval [0, 1]
𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) ≥ 0𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑖) = 1
Normalization of similarity to the interval [−1, 1]
𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) ≥ −1𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑖) = 1
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 36 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Properties of similarity and distance
Additional propertiesSome additional properties of the 𝑑𝑖𝑠𝑡 functions:
Triangle inequality
𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑗) ≤ 𝑑𝑖𝑠𝑡(𝑑𝑖, 𝑑𝑘) + 𝑑𝑖𝑠𝑡(𝑑𝑘, 𝑑𝑗)
If triangle inequality is satisfied then 𝑑𝑖𝑠𝑡 function defines a metric onthe vector space
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 37 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Euclidean distance
NormWe can apply the Euclidean or ℓ2 norm as to caculate a distance metric inthe vector space model.
||x||2 =√√√⎷
𝑛∑𝑖=1
𝑥2𝑖
||x||2 = √x𝑇x
Euclidean distance of two vectors d𝑖 and d𝑗 is the length of theirdisplacement vector x = d𝑖 − d𝑗
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 38 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Euclidean distance: Example
x(Beijing)
01
2
3
y(Chinese)
0
1
2
3
z(Tokyo
)
0
1
2
3
d1:(0,2,0)
d2:(3,2,2)
d3:(2,1,3)d4:(0,0,3)
d5:(0,0,1)
Geometric representation of data
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 39 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Euclidean distance
D =⎛⎜⎜⎜⎜⎜⎜⎜⎝
d𝑇1
d𝑇2⋮d𝑇
𝑛
⎞⎟⎟⎟⎟⎟⎟⎟⎠
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 40 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Euclidean distance
D =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0 2 03 2 22 1 30 0 30 0 1
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 41 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Euclidean distance
dist =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0. 3.60555128 3.74165739 3.60555128 2.236067983.60555128 0. 1.73205081 3.74165739 3.741657393.74165739 1.73205081 0. 2.23606798 3.3.60555128 3.74165739 2.23606798 0. 2.2.23606798 3.74165739 3. 2. 0.
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 42 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Cosine similarity
Angle between vectorsWe can think of the angle 𝜑 between two vectors d𝑖 and d𝑗 as a measureof their similarity.
Smaller angles mean that the vectors point in two directions that areclose to each otherThus, smaller angles mean that the vectors are more similarLarger angles imply smaller similarityTo obtain a numerical value from the interval [0, 1] we can calculate𝑐𝑜𝑠𝜑Cosine of an angle can be also negative! Why do we always get thevalues from [0, 1]
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 43 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Cosine similarity: Example
x(Beijing)
01
2
3
y(Chinese)
0
1
2
3
z(Tokyo
)
0
1
2
3
d1:(0,2,0)
d2:(3,2,2)
d3:(2,1,3)d4:(0,0,3)
d5:(0,0,1)
ϕ1,5
ϕ2,3
Geometric representation of data
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 44 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Cosine similarity
We can calculate the cosine similarity by making use of the dotproduct:
𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗) =d𝑇
𝑖 d𝑗||d𝑖||2||d𝑗||2
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 45 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Cosine similarity
D =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
0 2 03 2 22 1 30 0 30 0 1
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 46 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Vector Space Model: Document-Term Matrix
Cosine similarity
sim =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
1. 0.48507125 0.26726124 0. 0.0.48507125 1. 0.90748521 0.48507125 0.485071250.26726124 0.90748521 1. 0.80178373 0.80178373
0. 0.48507125 0.80178373 1. 1.0. 0.48507125 0.80178373 1. 1.
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 47 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
Representing data as matrices
There are many other sources of the data that can be represented aslarge matricesWe use also matrices to represent the social networks, etc.Measurement data, and much moreIn recommender systems, we represent user ratings of items as autility matrix
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 48 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
Recommender systems
Recommender systems predict user responses to optionsE.g. recommend news articles based on prediction of user interestsE.g. recommend products based on the predictions of what usermight likeRecommender systems are necessary whenever users interact withhuge catalogs of itemsE.g. thousands, even millions of products, movies, music, and so on.
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 49 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
Recommender systems
These huge numbers of items arise because online you can interactalso with the “long tail” itemsThese are not available in retail because of e.g. limited shelf spaceRecommender systems try to support this interaction by suggestingcertain items to the userThe suggestions or recommendations are based on what the systemsknow about the user and about the items
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 50 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
Formal model: the utility matrix
In a recommender system there are two classes of entities: users anditems
Utility FunctionLet us denote the set of all users with 𝑈 and the set of all items with 𝐼.We define the utility function 𝑢 ∶ 𝑈𝑥𝐼 → 𝑅, where 𝑅 is a set of ratings andis a totally ordered set.For example, 𝑅 = {1, 2, 3, 4, 5} set of star ratings, or 𝑅 = [0, 1] set of realnumbers from that interval.
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 51 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
The Utility Matrix
The utility function maps pairs of users and items to numbersThese numbers can be represented by a utility matrix M ∈ ℝ𝑛×𝑚,where 𝑛 is the number of users and 𝑚 the number of itemsThe matrix gives a value for each user-item pair where we knowabout the preference of that user for that itemE.g. the values can come from an ordered set (1 to 5) and represent arating that a user gave for an item
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 52 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
The Utility Matrix
We assume that the matrix is sparseThis means that most entries are unknownThe majority of the user preferences for specific items is unknownAn unknown rating means that we do not have explicit informationIt does not mean that the rating is lowFormally: the goal of a recommender system is to predict the blank inthe utility matrix
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 53 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
The Utility Matrix: example
UserMovie HP1 HP2 HP3 Hobbit SW1 SW2 SW3
A 4 5 1B 5 5 4C 2 4 5D 3 3
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 54 / 55
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
...
.
Recommender Systems: Utility Matrix
Recommender systems: short summary
Three key problems in the recommender systems:Collecting data for the utility matrix
1 Explicit by e.g. collecting ratings2 Implicit by e.g. interactions (users buy a product implies a high rating)
Predicting missing values in the utility matrix1 Problems are sparsity, cold start, and so on.2 Content-based, collaborative filtering, matrix factorization
Evaluating predictions1 Training dataset, test dataset, error
Denis Helic (ISDS, TU Graz) KDDM1 October 30, 2018 55 / 55
Recommended