UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization Jaegul...

Preview:

Citation preview

UTOPIAN: User-Driven Topic Modeling Based on Interactive Nonnegative Matrix Factorization

Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1

1Georgia Institute of Technology, 2Wayne State University

*e-mail: jaegul.choo@cc.gatech.edu

Intro: Topic Modeling

genedna genetic lifeevolve organismbrain neuronnerve

Document 1 Document 2 Document 3 Document 4

Intro: Topic Modeling

Topic: a distribution over keywords

genedna genetic lifeevolve organismbrain neuronnerve

Document 1 Document 2 Document 3 Document 4

Topic 1 Topic 2 Topic 3

Intro: Topic Modeling

Topic: a distribution over keywords

Document :

a distribution over topic

Topic 1 Topic 2 Topic 3

genedna genetic lifeevolve organismbrain neuronnerve

Document 1 Document 2 Document 3 Document 4

Latent Dirichlet Allocation (LDA) in Visual Analytics

• LDA has been widely used in visual analytics. • TIARA [Wei et al. KDD10], iVisClustering [Lee et al. EuroVis12], ParallelTopics

[Dou et al. VAST12], TopicViz [Eisenstein et al. CHI-WIP12], …

*Image courtesy of original papers.

• Proposes nonnegative matrix factorization (NMF) for topic modeling.• Highlights advantages of NMF over LDA in visual analytics.• Presents UTOPIAN, an NMF-based interactive topic modeling system.

Topic merging

Topic splittingDoc-induced topic

creation

Keyword-induced topic creation

Overview of Our Work

What is Nonnegative Matrix Factorization?

Nonnegative Matrix Factorization (NMF)

Lower-rank approximation with nonnegativity constraints

Why nonnegativity?Easy interpretation and semantically meaningful output

AlgorithmAlternating nonnegativity-constrained least squares [Kim et al., 2008]

~=

min || A – WH ||F

W>=0, H>=0

A

H

W

NMF as Topic Modeling~=

A

H

W

Topic: a distribution over keywords

Document :

a distribution over topic

Topic 1 Topic 2 Topic 3

genedna genetic lifeevolve organismbrain neuronnerve

Document 1 Document 2 Document 3 Document 4

H

W

Why NMF in Visual Analytics?

Advantages of NMF in Visual Analytics

• Reliable algorithmic behaviors• Flexible support for user interactions

NMF vs. LDAConsistency from Multiple Runs

Documents’ topical membership changes among 10 runs

InfoVis/VAST paper data set 20 newsgroup data set

NMF vs. LDAEmpirical Convergence

Documents’ topical membership changes between iterations

InfoVis/VAST paper data set

LDANMF

10 minutes48 seconds

NMF vs. LDATopic Summary (Top Keywords)

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7

NMFRun #1 visualization

designinformation

useranalysissystem

graphlayout

visualanalytics

datasets

colorweaving

Run #2 visualizationdesign

informationuser

analysissystem

graphlayout

visualanalytics

datasets

colorweaving

LDARun #1 documents

similaritiesknowledge

edgequery

collaborativesocialtree

measuresmultivariate

treeanimation

dimensionstreemap

Run #2 documentsquery

analystsscatterplot

spatialcollaborative

textdocuments

multidimensional, high

treeaggregation

dimensionstreemap

InfoVis/VAST paper data set

Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.

Advantages of NMF in Visual Analytics

• Reliable algorithmic behaviors• Flexible support for user interactions

min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2

W>=0, H>=0

•Wr, Hr : reference matrices for W and H

•MW, MH : diagonal matrices for weighting/masking columns/rows of W and H

Provides flexible yet intuitive means for user interaction.

Maintains the same computational complexity as original NMF.

Weakly Supervised NMF [Choo et al., DMKD, accepted with rev.]

UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF

Topic merging

Topic splitting

Doc-induced topic creation

Keyword-induced topic creation

UTOPIAN Overview

Topic merging

Topic splitting

Doc-induced topic creation

Keyword-induced topic creation

Supervised t-distributed stochastic neighbor embedding (t-SNE)

User interactions supported•Keyword refinement

•Topic merging/splitting

•Keyword-/document-induced

topic creation

Real-time interaction via

PIVE (Per-Iteration

Visualization Environment)

Original t-SNE•Documents are often too noisy to work with.

Supervised t-SNE

Supervised t-SNE

• d(xi, xj) ← α•d(xi, xj) if xi and xj belongs to the same topic cluster.

PIVE (Per-Iteration Visualization Environment) for Real-time Interaction [Choo et al., under revision]

Standard approach

PIVE approach

Demo Videohttp://tinyurl.com/UTOPIAN2013

Usage Scenario: Hyundai Genesis Review Data

Initial result After interaction

Summary

• Presented UTOPIAN, a User-Driven Topic Modeling based on Interactive NMF.

• Highlighted the advantages of NMF over LDA in visual analytics. • Reliable algorithmic behaviors

• Consistency from multiple runs• Early empirical convergence

• Flexible support for user interactions• Keyword refinement• Topic merging/splitting• Keyword-/document-induced topic creation

More in the paper & On-going Work

• A general taxonomy of user interactions with computational methods• Keyword-based vs. document-based• Template-based vs. from-scratch-based

• Algorithmic details about supported user interactions• Implementation details• More usage scenarios

On-going Work• Scaling up the system with parallel distributed NMF

Topic merging

Topic splitting

Doc-induced topic creation

Keyword-induced topic creation

Thank you!http://tinyurl.com/UTOPIAN2013

For more details,

please find me at

‘Meet the Candidate’

A601+ A602,

6PM today

Jaegul Choojaegul.choo@cc.gatech.edu

http://www.cc.gatech.edu/~joyfull/

Recommended