Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
CS448B :: 28 Oct 2009
Text Visualization
Jeffrey Heer Stanford University
Final Project
Project: Visualizing a Century of Change in the American West and Generating Social Sense-Making through Media Partnerships
f d d f d hData: 100 years of Census data, data from iMapData.com, and other sources.
Description: Explore a century of change in the American West by visualizing 100 years of data on the some of most compelling, urgent, contemporary issues — economy, health and health care, environment, education, religion, communications, energy, and transportation. Stimulate public discussion by revealing patterns and processes and generate widespread public participation in social sense-making by publishing the results with our media partners, including The New York Times, Wall Street Journal, Washington Post and Marketplace on National Public Radio Work in a collaborativeWashington Post, and Marketplace on National Public Radio. Work in a collaborative setting with a ready-made team of scholars at the Bill Lane Center for the American West (Jon Christensen, David Kennedy, Michael DeAlessi), the Communication Department's Journalism Program (Ann Grimes, Fred Turner), Knight Journalism Fellowship (Krissy Clark), and Tooling Up for Digital Histories graduate student researchers (Yuankai Ge, Dan Chang). Study the results as a real-world experiment in data visualization, social sense-making, and human-computer interactions.
http://ruralwest.stanford.edu/cgi-bin/RuralWestPop/
2
Final ProjectDesign a new visualization technique or system
Implementation of new design or systemImplementation of new design or system8-10 page paper in conference paper format2 Project Presentations
ScheduleProject Proposal: Wednesday, November 4 (end of day)Initial Presentation: Monday, November 9 & Wednesday, November 11Poster Presentation Wednesday December 2 (tentative)Poster Presentation: Wednesday, December 2 (tentative)Final Papers: Monday, December 7 (by noon)
LogisticsGroups of up to 3 people, graded individuallyClearly report responsibilities of each member
Assign 3: Interactive Visualization
Create an interactive visualization application. Choose a data domain and select an appropriate visualization techniquedomain and select an appropriate visualization technique.
1. Choose a data set and storyboard your interface2. Implement the interface using tools of your choice3. Submit your application and produce a final write-up
You may work individually or in groups of 2.y y g pDue by end of day on Wednesday, October 28
Text Visualization Why visualize text?
3
Visualization PopularityService
Many‐Eyes SwivelTag Cloud
Bubble GraphWord TreeBar Chart
MapsNetwork Diagram
TreemapMatrix ChartLine GraphScatterplot
Stacked GraphPie Chart
Over 1/3 of Many-Eyes visualizations use free text
0.0 0.1 0.2 0.3 0.4 0.5
Percentage
0.0 0.1 0.2 0.3 0.4 0.5
Percentage
Pie ChartHistogram
What is text data?
DocumentsA ti l b k d lArticles, books and novelsComputer programsE-mails, web pages, blogsTags, comments
Collection of documentsMessages (e-mail, blogs, tags, comments)Social networks (personal profiles)Academic collaborations (publications)
Challenge: Visualize Dissertations
You have 20 years of university Ph.D. theses:Text, Year, Dept., Author, Advisor, Committee
What questions might you want to answer?What visualizations might help?
A Concrete Example
What would help you gauge…
The topics in the document?Whether or not you should read it?yIts relationship to other documents?
4
Tag Cloud: Word Counts Word Tree: Word Sequences
PhraseNet: “A the B”MIMIR – Jason Chuang
5
Topics
Text as DataVisualizing Document ContentEvolving DocumentsVisualizing ConversationDocument Collections
Text as Data
Words are (not) nominal?
High dimensional (10,000+)More than equality testsWords have meanings and relations
Correlations: Hong Kong, San Francisco, Bay Area
Order: April, February, January, June, March, May
M b hiMembership: Tennis, Running, Swimming, Hiking, Piano
Hierarchy, antonyms & synonyms, entities, …
Text Processing Pipeline
Tokenization: segment text into termsSpecial cases? e.g., “San Francisco”, “L’ensemble”, “U.S.A.”Remove stop words? e.g., “a”, “an”, “the”, “to”, “be”?
Stemming: one means of normalizing termsReduce terms to their “root”; Porter’s algorithm for Englishe g automate(s) automatic automation all map to automate.g., automate(s), automatic, automation all map to automatFor visualization, want to reverse stemming for labels
Simple solution: map from stem to the most frequent word
Result: ordered stream of terms
6
The Bag of Words Model
Ignore ordering relationships within the text
A document ≈ vector of term weightsEach dimension corresponds to a term (10,000+)Each value represents the relevance
For example, simple term counts
Aggregate into a document x term matrixDocument vector space model
Document x Term matrix
Each document is a vector of term weightsSimplest weighting is to just count occurrences
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
WordCount (Harris 2004)
http://wordcount.org
7
Weaknesses of Tag Clouds
Sub-optimal visual encoding (size vs. position)Inaccurate size encoding (long words are bigger)May not facilitate comparison (unstable layout)Term frequency may not be meaningfulDoes not show the structure of the text
Keyword Weighting
Term Frequencytf t(t) i dtftd = count(t) in dCan take log frequency: log(1 + tftd)Can normalize to show proportion: tftd / Σt tftd
Keyword Weighting
Term Frequencytf t(t) i dtftd = count(t) in d
TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs
8
Keyword Weighting
Term Frequencytf t(t) i dtftd = count(t) in d
TF.IDF: Term Freq by Inverse Document Freqtf.idftd = log(1 + tftd) × log(N/dft)dft = # docs containing t; N = # of docs
G2: Probability of different word frequencyG2: Probability of different word frequencyE1 = |d| × (tftd + tft(C-d)) / |C|E2 = |C-d| × (tftd + tft(C-d)) / |C|G2 = 2 × (tftd log(tftd/E1) + tft(C-d) log(tft(C-d)/E2))
Doc. Similarity & Clustering
In vector model, compute distance among docsFor TF.IDF, typically cosine distanceSimilarity measure can be used to cluster
Topic modeling approachesAssume documents are a mixture of topicsTopics are (roughly) a set of co-occurring termsLatent Semantic Analysis (LSA): reduce term matrixLatent Dirichlet Allocation (LDA): statistical model
9
Visualizing Document Content
TileBars [Hearst]
10
Visual Thesaurus [ThinkMap]
Concordance
What is the common local context of a term?
11
WordTree (Wattenberg et al)
Filter infrequent runs Recurrent themes in speech
12
Glimpses of structure
Concordances show local, repeated structureBut what about other types of patterns?
For example Lexical: <A> at <B> Syntactic: <Noun> <Verb> <Object>Syntactic: <Noun> <Verb> <Object>
Phrase Nets [van Ham et al]
Look for specific linking patterns in the text:‘A and B’, ‘A at B’, ‘A of B’, etcCould be output of regexp or parser.
Visualize extracted patterns in a node-link viewOccurrences Node sizePattern position Edge direction
Portrait of the Artist as a Young ManX and Y
13
Node Grouping The BibleX begat Y
Pride & PrejudiceX at YLexical Parser, < 1sec running time
Pride & PrejudiceX at YSyntactic Parser, > 24 hours running time
14
18th & 19th Century NovelsX’s Y
X of Y
X of Y
Evolving Documents
15
Visualizing Revision History
How to depict contributions over time?
Example: Wikipedia history log
Animated Traces (Ben Fry)
http://benfry.com/traces/
16
Diff History Flow (Viégas et al)
Wikipedia History Flow (IBM)
17
Visualizing Conversation
Visualizing Conversation
Many dimensions to consider:Who (senders, receivers)What (the content of communication)When (temporal patterns)
Interesting cross-products:What x When Topic “Zeitgeist”Who x Who Social networkWho x Who x What x When Information flow
Usenet Visualization (Viégas & Smith)
Show correspondence patterns in text forumsInitiate vs. reply; size and duration of discussion
18
Newsgroup crowds / Authorlines
Mountain (Viégas)
Conversation by person over time (who x when).
Themail (Viégas et al)
One person over time, TF.IDF weighted terms
19
Enron E-Mail Corpus
Washington Lobbyist ?Washington Lobbyist ? Visualizing Document Collections
20
NewsMap: Google News Treemap (Marcos Weskamp)
10 x 10 News Map (Harris 2004)
Named Entity Recognition
Identify and classify named entities in text:John Smith PERSONSoviet Union COUNTRY353 Serra St ADDRESS(555) 721-4312 PHONE NUMBER
Entity relations: how do the entities relate?Simple approach: do they co-occur in a small window of text?
21
ThemeRiver (Havre et al 99)
0.003
0.0035
0.004
Statistical Machine Learning in Pubmed
supervised machine learning
probabilistic reasoning
mcmc
dimensionality / kernels
l i / i il i
0.001
0.0015
0.002
0.0025
clustering / similarity
bayesian learning
0
0.0005
0.001
1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Track topic strengths over time
Parallel Tag Clouds (Collins et al 09) Natural Science and Engineering(Ellingham 1948)
Manual analysis, ylayout, and labeling
TreemapCentral disciplinesSub-disciplinesTopics
H. J. T. Ellingham 1948Royal Society Scientific Information Conference
Topics
22
Geography of Science (Small et al. 1985)
Citation analysisHierarchy clusteringLevels of detailsEdge labeling
Henry Small, Eugene Garfield 1985The geography of science, Journal of Information Science
23
Scholarly Communication (Sandstrom 2001)
Hand-picked authors, citation analysisClustering by authorsManual cluster & dimension labeling
Pamela Sandstrom 2001Scholarly communication as a socio-ecological system, Scientometrics
Pamela Sandstrom 2001Scholarly communication as a socio-ecological system, Scientometrics
Melanoma Research (Boyack 2004)
Papers, genes, proteinsLandmark labeling
(manual)
Kevin Boyack, Ketan Mane, Katy Börner 2004Mapping medline papers, genes, and proteins related to
Melanoma research, IV2004 Conference
24
Scholarly Impact (Rosvall et al. 2007)
M. Rosvall, C. T. Bergstrom 2007Map of random walks on complex network
reveal community structure, PNAS
MIMIR – Jason Chuang