Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Hierarchical thematic model
visualising algorithm
Arsenty Kuzmin, Alexander Aduenko and Vadim Strijov
Moscow Institute of Physics and Technology
Department of Control and Applied Mathematics
High School of Economics26.11.2013
Problem statementAlgorithm of clusterization and visualisation
Experiment
Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference
EURO 2013 Thematic visualisation
We must offer a Decision Support System for thematic clustering.
The goals:
to construct a thematic model of the conference,
to reveal the inconsistencies between the constructed and theexpert model,
to visualise the expert model and the revealed inconsistencies.
Call for algorithmic model:
to be similar with the expert model,
to rank the inconsistencieses,
to have a plain representation of the conference structure.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 2 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference
EURO 2013: the conference structure
Create the expert thematic model
1 A group of experts is responsiable for an Area,
2 participants submit their Abstracts to the collection,
3 the experts distribute the Abstracts over the Streams,
4 the Abstracts are organised into the Sessions.A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 3 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference
Challenges
Сauses of problems
1 Great number of the experts (more than 200),
2 expert classification could be controversial,
3 there is no base thematic model.
The problems
1 To verify thematic consistency,
2 to detect inconsistencies in the hierarchical model,
3 to detect the unclaimed Streams and Sessions,
4 to assess quality of the expert hierarchical model.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 4 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference
Matrix “document/terms”
Let the theme of the document be determined by its terms.
W = {w1, . . . ,wn} is the dictionary of the conference.
Let the document be the bag of words.
The document d of the collection D is an unordered set of words ofthe dictionary W , d = {wj}, j ∈ {1, . . . , n}.
xs 7→xs
√
xT
s xs
, X =
x1,1 . . . x1,n
. . . . . . . . .
x|D|,1 . . . x|D|,n
.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 5 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference
Hierarchical representation of the thematic model
Each leaf (h, i) of the treecorresponds to the document di .
Each node (ℓ, i), ℓ 6= h
corresponds to the cluster cℓ,i ,which consists of correspondingdocuments.
Here ℓ is a conference level, h = 5 is the number levels and i is theindex of a node given level.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 6 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
Similarity function
Define the similarity function s(·, ·) between documents xi and xjas:
s(xi , xj) =x
T
i xj
‖xi‖2‖xj‖2
= xT
i xj .
Define the similarity function S(·, ·) between clusters cℓ,i and cℓ,j asthe mean s(x, y) between their documents x ∈ cℓ,i , y ∈ cℓ,j
S(cℓ,i , cℓ,j) =1
|A|
∑
(x, y)∈A
s(x, y),
where A is the set of all document pairs from clusters cℓ,i and cℓ,j ,x ∈ cℓ,i , y ∈ cℓ,j , x 6= y.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 7 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
The clustering quality function
Suppose F0 is a mean intra-cluster similarity: F0 =1
kℓ
kℓ∑
i=1
S(cℓ,i , cℓ,i ),
and F1 is a mean inter-cluster similarity: F1 =2
kℓ(kℓ − 1)
∑
i<j
S(cℓ,i , cℓ,j)
Clustering quality criterion
F = F1F0
→ min.
The expert hierarchical model is the origin for the algorithmicthematic model.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 8 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
Distance and similarity functions comparison
Area number
Are
anum
ber
5 10 15 20
5
10
15
20
1.32
1.33
1.34
1.35
1.36
1.37
1.38
1.39
1.4
Euclidean distance
Area number
Are
anum
ber
5 10 15 20
5
10
15
20
5.2
5.4
5.6
5.8
6
Hellinger distance
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 9 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
Distance and similarity functions comparison
Area number
Are
anum
ber
5 10 15 20
5
10
15
20
1.75
1.8
1.85
1.9
1.95
2
2.05
2.1
Jenson-Shannon distance
Area number
Are
anum
ber
5 10 15 20
5
10
15
20
0.02
0.04
0.06
0.08
0.1
0.12
Proposed similarity function
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 10 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
The quality function of the hierarchical model
The linear combination of intra- and inter-cluster similarities:
Q(x̄1, . . . , x̄k) =
=
h−1∑
ℓ=2
1 − α
kℓ
kℓ∑
i=1
|cℓ,i |S(cℓ,i , cℓ,i)−2α
kℓ(kℓ − 1)
∑
i<j
S(cℓ,i , cℓ,j)
→ max
α ∈ [0, 1] is the weights coefficient; it determines the clusteringpriority.kℓ is the quantity of clusters on level ℓ.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 11 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
Creating an algorithmic model, similar to the expert model
Penalty matrix.
PPPPPPPPP
FromTo
(+,+) (+,−) (−,−)
(+,+) δ11 δ12 δ13(+,−) δ21 δ22 δ23(−,−) δ31 δ32 δ33
Move a document from its expert cluster to an algorithmic cluster if
Q2 − Q1 ≥ δ
the algorithmic model quality drastically increases.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 12 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
Nested visualisation
Visualisation requirements
1 to keep hierarchical structure of the expert model
2 to preserve the relative distances
µ(cℓ,i) is the coordinates ofthe cluster cℓ,i center
ρ(·, ·) is the distancebetween documents in R
|W |
ρ2(·, ·) is the distancebetween projections of thedocuments
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 13 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Similarity functionHierarchical clusteringThe nested visualisation
Nested model creation
Let the cluster cℓ,i with the plain radius R is already placed on theplain; C1, . . . , Cq are the clusters of the level ℓ+ 1, which are incℓ,i , µ(C1), . . . , µ(Cq) are their centers, r1, . . . , rq are theirradiuses.
1 Make a Sammon’s projection of the centers µ(C1), . . . , µ(Cq).
2 Find the plain radiuses r̂1, . . . , r̂q of the clusters C1, . . . , Cq
as:r̂j = min
i 6=j
rj
rj + riρ2
(
µ(Ci ), µ(Cj))
.
3 Find ρ̂ = maxj∈{1, ..., q}
ρ2
(
µ(Cj),µℓ,i
)
+ r̂j is the distance to the
border of the projection, taking into consideration sizes ofclusters.
4 Translate the homothety of the ratioR
ρ̂and center µ(cℓ,i)
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 14 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Documents collection
The purpose of the experiment
Visualisation of hierarchical thematic model of EURO 2013.
Documents quantity: |D| = 2313, Areas: 24, Streams: 137.
Size of the dictionary: |W | = 1261.
Penalties are determined by the parameter of nonconformancewith the expert model: γ ≥ 0, F = γF̃.
Penalties matrix F̃.PPPPPPPPP
FromTo
(+,+) (+,−) (−,−)
(+,+) 0 0.002 0.005
(+,−) -0.001 0 0.003
(−,−) -0.003 -0.002 0
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 15 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Clustering results with various penalties γ
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 16 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Area similarity
Area number
Are
anum
ber
5 10 15 20
5
10
15
20
0.02
0.04
0.06
0.08
0.1
0.12
Expert clustering
Area number
Are
anum
ber
5 10 15 20
5
10
15
20
0.05
0.1
0.15
0.2
0.25
0.3
Algorithmic clustering
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 17 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Streams similarity
Stream number
Str
eam
num
ber
20 40 60 80 100 120
20
40
60
80
100
120
0.1
0.2
0.3
0.4
0.5
Expert clustering
Stream number
Str
eam
num
ber
20 40 60 80 100 120
20
40
60
80
100
120
0.1
0.2
0.3
0.4
0.5
0.6
Algorithmic clustering
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 18 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Inconsistencies between models, middling penalties
Expert model
Alg
ori
thm
icm
odel
5 10 15 20
5
10
15
20
0
10
20
30
40
50
60
70
80
Area level
Expert model
Alg
ori
thm
icm
odel
20 40 60 80 100 120
20
40
60
80
100
120
0
20
40
60
80
100
Stream level
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 19 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Degree of the cluster inconsistency
For each document the degree of inconsistency between the expertand the algorithmic model is determined by
1 the level number, where the models differ,
2 the distance between the expert and the algorithmic cluster onthis level.
Map the range of inconsistency to the color scale:
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 20 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Visualisation of inconsistencieses, γ = 1.25
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 21 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Visualisation of inconsistencieses, γ = 1.25
Ambiguous Networks
Game Theoretical Network Models
An approximation for the kth nearest distance and its application to location analysis
OR: Visualization and Arts
A column generation based heuristic for the solution of the single source capacitated multi−facility Weber problem
Location Analysis
Heuristics for a continuous multi−facility location al− location problem with demand regions
Supply Chain Planning
A Clustering Based Approach for Solving Planar Hub Location Problems
Hub Location
Using Structural Aspects to Find Susceptible Users in Social Networks
Computational Statistics
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 22 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Visualisation of inconsistencieses, γ = 1.25
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 23 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Visualisation of inconsistencieses, γ = 0.7
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 24 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Visualisation of inconsistencieses, γ = 0.5
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 25 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
The new goals
The conference structure varies slightly from year to year.
Collect information about previous conferences.
Make a global list of clusters and a global dictionary.
Predict a scruсture of the next conference without expertmodel.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 26 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Predict EURO 2012 model using structure of EURO 2012
0 2 4 6 8 1016
16.5
17
17.5
18
18.5
warea
Aver
age
ord
er
Extracted document
0 2 4 6 8 103
3.5
4
4.5
5
5.5
6
warea
Aver
age
ord
er
Not extracted document
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 27 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Predict EURO 2012 model using structure of EURO 2013
0 2 4 6 8 1021
21.5
22
22.5
23
23.5
24
24.5
warea
Aver
age
ord
er
Not extracted documentorder
Share
ofsa
mple
s
0 20 40 60 80 100 1200
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Not extracted document
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 28 / 29
Problem statementAlgorithm of clusterization and visualisation
Experiment
Hierarchical clusteringVisualisation of inconsistencieses
Conclusion
The new solutions:
a similarity function between clusters,
a method for thematic hierarchical model,
the way of visualising a large thematic model on the plain,
the way of visualising inconsistencieses between the expert andthe algorithmic model.
A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 29 / 29