CS B :: Oct 2009 Text Visualization · 10/28/2009 · 1 CS448B:: 28 Oct 2009 Text Visualization Jeffrey Heer Stanford University Final Project Project: Visualizing a Century of Change

1

CS448B :: 28 Oct 2009

Text Visualization

Jeffrey Heer Stanford University

Final Project

Project: Visualizing a Century of Change in the American West and Generating Social Sense-Making through Media Partnerships

f d d f d hData: 100 years of Census data, data from iMapData.com, and other sources.

Description: Explore a century of change in the American West by visualizing 100 years of data on the some of most compelling, urgent, contemporary issues — economy, health and health care, environment, education, religion, communications, energy, and transportation. Stimulate public discussion by revealing patterns and processes and generate widespread public participation in social sense-making by publishing the results with our media partners, including The New York Times, Wall Street Journal, Washington Post and Marketplace on National Public Radio Work in a collaborativeWashington Post, and Marketplace on National Public Radio. Work in a collaborative setting with a ready-made team of scholars at the Bill Lane Center for the American West (Jon Christensen, David Kennedy, Michael DeAlessi), the Communication Department's Journalism Program (Ann Grimes, Fred Turner), Knight Journalism Fellowship (Krissy Clark), and Tooling Up for Digital Histories graduate student researchers (Yuankai Ge, Dan Chang). Study the results as a real-world experiment in data visualization, social sense-making, and human-computer interactions.

http://ruralwest.stanford.edu/cgi-bin/RuralWestPop/

2

Final ProjectDesign a new visualization technique or system

Implementation of new design or systemImplementation of new design or system8-10 page paper in conference paper format2 Project Presentations

ScheduleProject Proposal: Wednesday, November 4 (end of day)Initial Presentation: Monday, November 9 & Wednesday, November 11Poster Presentation Wednesday December 2 (tentative)Poster Presentation: Wednesday, December 2 (tentative)Final Papers: Monday, December 7 (by noon)

LogisticsGroups of up to 3 people, graded individuallyClearly report responsibilities of each member

Assign 3: Interactive Visualization

Create an interactive visualization application. Choose a data domain and select an appropriate visualization techniquedomain and select an appropriate visualization technique.

1. Choose a data set and storyboard your interface2. Implement the interface using tools of your choice3. Submit your application and produce a final write-up

You may work individually or in groups of 2.y y g pDue by end of day on Wednesday, October 28

Text Visualization Why visualize text?

3

Visualization PopularityService

Many‐Eyes SwivelTag Cloud

Bubble GraphWord TreeBar Chart

MapsNetwork Diagram

TreemapMatrix ChartLine GraphScatterplot

Stacked GraphPie Chart

Over 1/3 of Many-Eyes visualizations use free text

0.0 0.1 0.2 0.3 0.4 0.5

Percentage

0.0 0.1 0.2 0.3 0.4 0.5

Percentage

Pie ChartHistogram

What is text data?

DocumentsA ti l b k d lArticles, books and novelsComputer programsE-mails, web pages, blogsTags, comments

Collection of documentsMessages (e-mail, blogs, tags, comments)Social networks (personal profiles)Academic collaborations (publications)

Challenge: Visualize Dissertations

You have 20 years of university Ph.D. theses:Text, Year, Dept., Author, Advisor, Committee

What questions might you want to answer?What visualizations might help?

A Concrete Example

What would help you gauge…

The topics in the document?Whether or not you should read it?yIts relationship to other documents?

4

Tag Cloud: Word Counts Word Tree: Word Sequences

PhraseNet: “A the B”MIMIR – Jason Chuang

5

Topics

Text as DataVisualizing Document ContentEvolving DocumentsVisualizing ConversationDocument Collections

Text as Data

Words are (not) nominal?

High dimensional (10,000+)More than equality testsWords have meanings and relations

Correlations: Hong Kong, San Francisco, Bay Area

Order: April, February, January, June, March, May

M b hiMembership: Tennis, Running, Swimming, Hiking, Piano

Hierarchy, antonyms & synonyms, entities, …

Text Processing Pipeline

Tokenization: segment text into termsSpecial cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?

Stemming: one means of normalizing termsReduce terms to their “root”; Porter’s algorithm for Englishe g automate(s) automatic automation all map to automate.g., automate(s), automatic, automation all map to automatFor visualization, want to reverse stemming for labels

Simple solution: map from stem to the most frequent word

Result: ordered stream of terms

6

The Bag of Words Model

Ignore ordering relationships within the text

A document ≈ vector of term weightsEach dimension corresponds to a term (10,000+)Each value represents the relevance

For example, simple term counts

Aggregate into a document x term matrixDocument vector space model

Document x Term matrix

Each document is a vector of term weightsSimplest weighting is to just count occurrences

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

WordCount (Harris 2004)

http://wordcount.org

7

Weaknesses of Tag Clouds

Sub-optimal visual encoding (size vs. position)Inaccurate size encoding (long words are bigger)May not facilitate comparison (unstable layout)Term frequency may not be meaningfulDoes not show the structure of the text

Keyword Weighting

Term Frequencytf t(t) i dtftd = count(t) in dCan take log frequency: log(1 + tftd)Can normalize to show proportion: tftd / Σt tftd

Keyword Weighting

Term Frequencytf t(t) i dtftd = count(t) in d

TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs

8

Keyword Weighting

Term Frequencytf t(t) i dtftd = count(t) in d

TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs

G2: Probability of different word frequencyG2: Probability of different word frequencyE1 = |d| × (tftd + tft(C-d)) / |C|E2 = |C-d| × (tftd + tft(C-d)) / |C|G2 = 2 × (tftd log(tftd/E1) + tft(C-d) log(tft(C-d)/E2))

Doc. Similarity & Clustering

In vector model, compute distance among docsFor TF.IDF, typically cosine distanceSimilarity measure can be used to cluster

Topic modeling approachesAssume documents are a mixture of topicsTopics are (roughly) a set of co-occurring termsLatent Semantic Analysis (LSA): reduce term matrixLatent Dirichlet Allocation (LDA): statistical model

9

Visualizing Document Content

TileBars [Hearst]

10

Visual Thesaurus [ThinkMap]

Concordance

What is the common local context of a term?

11

WordTree (Wattenberg et al)

Filter infrequent runs Recurrent themes in speech

12

Glimpses of structure

Concordances show local, repeated structureBut what about other types of patterns?

For example Lexical: <A> at <B> Syntactic: <Noun> <Verb> <Object>Syntactic: <Noun> <Verb> <Object>

Phrase Nets [van Ham et al]

Look for specific linking patterns in the text:‘A and B’, ‘A at B’, ‘A of B’, etcCould be output of regexp or parser.

Visualize extracted patterns in a node-link viewOccurrences Node sizePattern position Edge direction

Portrait of the Artist as a Young ManX and Y

13

Node Grouping The BibleX begat Y

Pride & PrejudiceX at YLexical Parser, < 1sec running time

Pride & PrejudiceX at YSyntactic Parser, > 24 hours running time

14

18th & 19th Century NovelsX’s Y

X of Y

X of Y

Evolving Documents

15

Visualizing Revision History

How to depict contributions over time?

Example: Wikipedia history log

Animated Traces (Ben Fry)

http://benfry.com/traces/

16

Diff History Flow (Viégas et al)

Wikipedia History Flow (IBM)

17

Visualizing Conversation

Visualizing Conversation

Many dimensions to consider:Who (senders, receivers)What (the content of communication)When (temporal patterns)

Interesting cross-products:What x When Topic “Zeitgeist”Who x Who Social networkWho x Who x What x When Information flow

Usenet Visualization (Viégas & Smith)

Show correspondence patterns in text forumsInitiate vs. reply; size and duration of discussion

18

Newsgroup crowds / Authorlines

Mountain (Viégas)

Conversation by person over time (who x when).

Themail (Viégas et al)

One person over time, TF.IDF weighted terms

19

Enron E-Mail Corpus

Washington Lobbyist ?Washington Lobbyist ? Visualizing Document Collections

20

NewsMap: Google News Treemap (Marcos Weskamp)

10 x 10 News Map (Harris 2004)

Named Entity Recognition

Identify and classify named entities in text:John Smith PERSONSoviet Union COUNTRY353 Serra St ADDRESS(555) 721-4312 PHONE NUMBER

Entity relations: how do the entities relate?Simple approach: do they co-occur in a small window of text?

21

ThemeRiver (Havre et al 99)

0.003

0.0035

0.004

Statistical Machine Learning in Pubmed

supervised machine learning

probabilistic reasoning

mcmc

dimensionality / kernels

l i / i il i

0.001

0.0015

0.002

0.0025

clustering / similarity

bayesian learning

0

0.0005

0.001

1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Track topic strengths over time

Parallel Tag Clouds (Collins et al 09) Natural Science and Engineering(Ellingham 1948)

Manual analysis, ylayout, and labeling

TreemapCentral disciplinesSub-disciplinesTopics

H. J. T. Ellingham 1948Royal Society Scientific Information Conference

Topics

22

Geography of Science (Small et al. 1985)

Citation analysisHierarchy clusteringLevels of detailsEdge labeling

Henry Small, Eugene Garfield 1985The geography of science, Journal of Information Science

23

Scholarly Communication (Sandstrom 2001)

Hand-picked authors, citation analysisClustering by authorsManual cluster & dimension labeling

Pamela Sandstrom 2001Scholarly communication as a socio-ecological system, Scientometrics

Pamela Sandstrom 2001Scholarly communication as a socio-ecological system, Scientometrics

Melanoma Research (Boyack 2004)

Papers, genes, proteinsLandmark labeling

(manual)

Kevin Boyack, Ketan Mane, Katy Börner 2004Mapping medline papers, genes, and proteins related to

Melanoma research, IV2004 Conference

24

Scholarly Impact (Rosvall et al. 2007)

M. Rosvall, C. T. Bergstrom 2007Map of random walks on complex network

reveal community structure, PNAS

MIMIR – Jason Chuang