29
Hierarchical thematic model visualising algorithm Arsenty Kuzmin, Alexander Aduenko and Vadim Strijov Moscow Institute of Physics and Technology Department of Control and Applied Mathematics High School of Economics 26.11.2013

Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Hierarchical thematic model

visualising algorithm

Arsenty Kuzmin, Alexander Aduenko and Vadim Strijov

Moscow Institute of Physics and Technology

Department of Control and Applied Mathematics

High School of Economics26.11.2013

Page 2: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference

EURO 2013 Thematic visualisation

We must offer a Decision Support System for thematic clustering.

The goals:

to construct a thematic model of the conference,

to reveal the inconsistencies between the constructed and theexpert model,

to visualise the expert model and the revealed inconsistencies.

Call for algorithmic model:

to be similar with the expert model,

to rank the inconsistencieses,

to have a plain representation of the conference structure.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 2 / 29

Page 3: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference

EURO 2013: the conference structure

Create the expert thematic model

1 A group of experts is responsiable for an Area,

2 participants submit their Abstracts to the collection,

3 the experts distribute the Abstracts over the Streams,

4 the Abstracts are organised into the Sessions.A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 3 / 29

Page 4: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference

Challenges

Сauses of problems

1 Great number of the experts (more than 200),

2 expert classification could be controversial,

3 there is no base thematic model.

The problems

1 To verify thematic consistency,

2 to detect inconsistencies in the hierarchical model,

3 to detect the unclaimed Streams and Sessions,

4 to assess quality of the expert hierarchical model.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 4 / 29

Page 5: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference

Matrix “document/terms”

Let the theme of the document be determined by its terms.

W = {w1, . . . ,wn} is the dictionary of the conference.

Let the document be the bag of words.

The document d of the collection D is an unordered set of words ofthe dictionary W , d = {wj}, j ∈ {1, . . . , n}.

xs 7→xs

xT

s xs

, X =

x1,1 . . . x1,n

. . . . . . . . .

x|D|,1 . . . x|D|,n

.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 5 / 29

Page 6: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Structure of a large conferenceMain problems and proposed solutionsMathematical description of the conference

Hierarchical representation of the thematic model

Each leaf (h, i) of the treecorresponds to the document di .

Each node (ℓ, i), ℓ 6= h

corresponds to the cluster cℓ,i ,which consists of correspondingdocuments.

Here ℓ is a conference level, h = 5 is the number levels and i is theindex of a node given level.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 6 / 29

Page 7: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

Similarity function

Define the similarity function s(·, ·) between documents xi and xjas:

s(xi , xj) =x

T

i xj

‖xi‖2‖xj‖2

= xT

i xj .

Define the similarity function S(·, ·) between clusters cℓ,i and cℓ,j asthe mean s(x, y) between their documents x ∈ cℓ,i , y ∈ cℓ,j

S(cℓ,i , cℓ,j) =1

|A|

(x, y)∈A

s(x, y),

where A is the set of all document pairs from clusters cℓ,i and cℓ,j ,x ∈ cℓ,i , y ∈ cℓ,j , x 6= y.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 7 / 29

Page 8: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

The clustering quality function

Suppose F0 is a mean intra-cluster similarity: F0 =1

kℓ

kℓ∑

i=1

S(cℓ,i , cℓ,i ),

and F1 is a mean inter-cluster similarity: F1 =2

kℓ(kℓ − 1)

i<j

S(cℓ,i , cℓ,j)

Clustering quality criterion

F = F1F0

→ min.

The expert hierarchical model is the origin for the algorithmicthematic model.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 8 / 29

Page 9: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

Distance and similarity functions comparison

Area number

Are

anum

ber

5 10 15 20

5

10

15

20

1.32

1.33

1.34

1.35

1.36

1.37

1.38

1.39

1.4

Euclidean distance

Area number

Are

anum

ber

5 10 15 20

5

10

15

20

5.2

5.4

5.6

5.8

6

Hellinger distance

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 9 / 29

Page 10: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

Distance and similarity functions comparison

Area number

Are

anum

ber

5 10 15 20

5

10

15

20

1.75

1.8

1.85

1.9

1.95

2

2.05

2.1

Jenson-Shannon distance

Area number

Are

anum

ber

5 10 15 20

5

10

15

20

0.02

0.04

0.06

0.08

0.1

0.12

Proposed similarity function

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 10 / 29

Page 11: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

The quality function of the hierarchical model

The linear combination of intra- and inter-cluster similarities:

Q(x̄1, . . . , x̄k) =

=

h−1∑

ℓ=2

1 − α

kℓ

kℓ∑

i=1

|cℓ,i |S(cℓ,i , cℓ,i)−2α

kℓ(kℓ − 1)

i<j

S(cℓ,i , cℓ,j)

→ max

α ∈ [0, 1] is the weights coefficient; it determines the clusteringpriority.kℓ is the quantity of clusters on level ℓ.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 11 / 29

Page 12: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

Creating an algorithmic model, similar to the expert model

Penalty matrix.

PPPPPPPPP

FromTo

(+,+) (+,−) (−,−)

(+,+) δ11 δ12 δ13(+,−) δ21 δ22 δ23(−,−) δ31 δ32 δ33

Move a document from its expert cluster to an algorithmic cluster if

Q2 − Q1 ≥ δ

the algorithmic model quality drastically increases.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 12 / 29

Page 13: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

Nested visualisation

Visualisation requirements

1 to keep hierarchical structure of the expert model

2 to preserve the relative distances

µ(cℓ,i) is the coordinates ofthe cluster cℓ,i center

ρ(·, ·) is the distancebetween documents in R

|W |

ρ2(·, ·) is the distancebetween projections of thedocuments

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 13 / 29

Page 14: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Similarity functionHierarchical clusteringThe nested visualisation

Nested model creation

Let the cluster cℓ,i with the plain radius R is already placed on theplain; C1, . . . , Cq are the clusters of the level ℓ+ 1, which are incℓ,i , µ(C1), . . . , µ(Cq) are their centers, r1, . . . , rq are theirradiuses.

1 Make a Sammon’s projection of the centers µ(C1), . . . , µ(Cq).

2 Find the plain radiuses r̂1, . . . , r̂q of the clusters C1, . . . , Cq

as:r̂j = min

i 6=j

rj

rj + riρ2

(

µ(Ci ), µ(Cj))

.

3 Find ρ̂ = maxj∈{1, ..., q}

ρ2

(

µ(Cj),µℓ,i

)

+ r̂j is the distance to the

border of the projection, taking into consideration sizes ofclusters.

4 Translate the homothety of the ratioR

ρ̂and center µ(cℓ,i)

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 14 / 29

Page 15: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Documents collection

The purpose of the experiment

Visualisation of hierarchical thematic model of EURO 2013.

Documents quantity: |D| = 2313, Areas: 24, Streams: 137.

Size of the dictionary: |W | = 1261.

Penalties are determined by the parameter of nonconformancewith the expert model: γ ≥ 0, F = γF̃.

Penalties matrix F̃.PPPPPPPPP

FromTo

(+,+) (+,−) (−,−)

(+,+) 0 0.002 0.005

(+,−) -0.001 0 0.003

(−,−) -0.003 -0.002 0

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 15 / 29

Page 16: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Clustering results with various penalties γ

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 16 / 29

Page 17: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Area similarity

Area number

Are

anum

ber

5 10 15 20

5

10

15

20

0.02

0.04

0.06

0.08

0.1

0.12

Expert clustering

Area number

Are

anum

ber

5 10 15 20

5

10

15

20

0.05

0.1

0.15

0.2

0.25

0.3

Algorithmic clustering

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 17 / 29

Page 18: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Streams similarity

Stream number

Str

eam

num

ber

20 40 60 80 100 120

20

40

60

80

100

120

0.1

0.2

0.3

0.4

0.5

Expert clustering

Stream number

Str

eam

num

ber

20 40 60 80 100 120

20

40

60

80

100

120

0.1

0.2

0.3

0.4

0.5

0.6

Algorithmic clustering

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 18 / 29

Page 19: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Inconsistencies between models, middling penalties

Expert model

Alg

ori

thm

icm

odel

5 10 15 20

5

10

15

20

0

10

20

30

40

50

60

70

80

Area level

Expert model

Alg

ori

thm

icm

odel

20 40 60 80 100 120

20

40

60

80

100

120

0

20

40

60

80

100

Stream level

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 19 / 29

Page 20: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Degree of the cluster inconsistency

For each document the degree of inconsistency between the expertand the algorithmic model is determined by

1 the level number, where the models differ,

2 the distance between the expert and the algorithmic cluster onthis level.

Map the range of inconsistency to the color scale:

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 20 / 29

Page 21: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Visualisation of inconsistencieses, γ = 1.25

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 21 / 29

Page 22: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Visualisation of inconsistencieses, γ = 1.25

Ambiguous Networks

Game Theoretical Network Models

An approximation for the kth nearest distance and its application to location analysis

OR: Visualization and Arts

A column generation based heuristic for the solution of the single source capacitated multi−facility Weber problem

Location Analysis

Heuristics for a continuous multi−facility location al− location problem with demand regions

Supply Chain Planning

A Clustering Based Approach for Solving Planar Hub Location Problems

Hub Location

Using Structural Aspects to Find Susceptible Users in Social Networks

Computational Statistics

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 22 / 29

Page 23: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Visualisation of inconsistencieses, γ = 1.25

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 23 / 29

Page 24: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Visualisation of inconsistencieses, γ = 0.7

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 24 / 29

Page 25: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Visualisation of inconsistencieses, γ = 0.5

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 25 / 29

Page 26: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

The new goals

The conference structure varies slightly from year to year.

Collect information about previous conferences.

Make a global list of clusters and a global dictionary.

Predict a scruсture of the next conference without expertmodel.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 26 / 29

Page 27: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Predict EURO 2012 model using structure of EURO 2012

0 2 4 6 8 1016

16.5

17

17.5

18

18.5

warea

Aver

age

ord

er

Extracted document

0 2 4 6 8 103

3.5

4

4.5

5

5.5

6

warea

Aver

age

ord

er

Not extracted document

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 27 / 29

Page 28: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Predict EURO 2012 model using structure of EURO 2013

0 2 4 6 8 1021

21.5

22

22.5

23

23.5

24

24.5

warea

Aver

age

ord

er

Not extracted documentorder

Share

ofsa

mple

s

0 20 40 60 80 100 1200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Not extracted document

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 28 / 29

Page 29: Hierarchical thematic model visualising algorithm · 2013. 11. 27. · A.Kuzmin, A.Aduenko and V.Strijov Thematic model visualising 2 / 29. Problem statement Algorithm of clusterization

Problem statementAlgorithm of clusterization and visualisation

Experiment

Hierarchical clusteringVisualisation of inconsistencieses

Conclusion

The new solutions:

a similarity function between clusters,

a method for thematic hierarchical model,

the way of visualising a large thematic model on the plain,

the way of visualising inconsistencieses between the expert andthe algorithmic model.

A. Kuzmin, A. Aduenko and V. Strijov Thematic model visualising 29 / 29