WHAT TO EXPECT WHEN YOU ARE VISUALIZING
Krist Wongsuphasawat / @kristw
Based on true stories Forever querying
Never-ending cleaning Hopelessly prototyping
Last minute coding and many more…
Computer Engineer Bangkok, Thailand
Chulalongkorn University
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer Bangkok, Thailand
Krist Wongsuphasawat / @kristw
(P.S. These are actually not my robots, but our competitors’.)
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
PhD in Computer Science Information Visualization Univ. of Maryland
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
IBMMicrosoft
PhD in Computer Science Information Visualization Univ. of Maryland
PhD in Computer Science Information Visualization Univ. of Maryland
IBMMicrosoft
Data Visualization Scientist Twitter
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
#interactive visualizations
Open-source projects
Visual Analytics Tools
DATA =ME+ VIS
Me
clients, data, requirements, etc.
WHAT TO EXPECT?
1. EXPECT POTENTIAL MISMATCHES
INPUT (DATA)What clients think they have
INPUT (DATA)What clients think they have What they usually have
YOUWhat clients think you are
YOUWhat clients think you are What they will get
OUTPUT (VIS)What clients ask for
OUTPUT (VIS)What clients ask for What they really need
COMMUNICATE
I need this. Take this.
I need this. Here you are.
I need this. Take this.
& COMPROMISE
2. EXPECT DIFFERENT REQUIREMENTS
DIFFERENT GOALSPresent Communicate information effectively
Explore Exploratory analysis, Reusable tools for exploration
Explore + Present Analyze data + tell story
Enjoy More flexible
DIFFERENT GOALSPresent Communicate information effectively
Explore Exploratory analysis, Reusable tools for exploration
Explore + Present Analyze data + tell story
Enjoy More flexible
3. EXPECT TO CLEAN DATA
DATA SOURCESOpen data Publicly available
Internal data Private, owned by clients’ organization
Self-collected data Manual, site scraping, etc.
Combine the above
MANY FORMS OF DATAStandalone files txt, csv, tsv, json, Google Docs, …, pdf*
APIs better quality with more overhead
Databases doesn’t necessary mean they are organized
Big data bigger pain
HAVING ALL TWEETSHow people think I feel.
How people think I feel. How I really feel.
HAVING ALL TWEETS
CHALLENGESGet relevant Tweets hashtag: #oscars keywords: “spotlight” (movie name)
Too big Need to aggregate & reduce size
Slow Long processing time (hours)
Hadoop Cluster
GETTING BIG DATA
Data Storage
Pig / Scalding (slow)
GETTING BIG DATAHadoop Cluster
Data Storage
Tool
Hadoop Cluster
Pig / Scalding (slow)
GETTING BIG DATA
Data Storage
Tool
Pig / Scalding (slow)
GETTING BIG DATAHadoop Cluster
Data Storage
Tool
Your laptop Smaller dataset
Hadoop Cluster
Pig / Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
CLEANINGData come in different formats. tsv to json
Quality of data collection. null, missing data, typos, timestamp
Filter Remove unnecessary data
Conversion Change country code from 3-letter (USA) to 2-letter (US) Correct time of day based on users’ timezone Convert lat/lon to county
etc.
4. EXPECT TO CLEAN DATA A LOT
70-80% of time cleaning data
“DATA JANITOR”
WHY?Definition of “clean” depends on the task. e.g. Restaurant reviews
USER RESTAURANT RATING========================A MCDONALD’S 3B MCDONALDS 3C MCDONALD 4D MCDONALDS 5E IHOP 4F SUBWAY 4
WHY?Definition of “clean” depends on the task. e.g. Restaurant reviews
Data issue can present itself anytime. in the project timeline
RAMSAY & RAMSEY
WHY?Definition of “clean” depends on the task. e.g. Restaurant reviews
Data issue can present itself anytime. in the project timeline
It takes time to process data. Run. Wait… Oops! Re-run. Wait…
RECOMMENDATIONSAlways think that you will have to do it again document the process, automation
Reusable scripts break a gigantic do-it-all function into smaller ones
Reusable data keep for future project
5. EXPECT TO TRY AND BREAK THINGS
https://twitter.com/hashtag/d3brokeandmadeart
#D3BROKEANDMADEART
6. EXPECT TO ITERATE UNTIL IT WORKS
7. EXPECT DEADLINE
EXAMPLE PROJECTS
EXAMPLE 1: STORYTELLING
WHAT TO EXPECTtimely Deadline is strict. Also can be unexpected events.
wide audience easy to explain and understand, multi-device support
one-off projects
content screening
from fans’ conversations
Reveal the talking points of every episode of
Problem is coming.CHAPTER I
Problem
Want to know what the audience talk about a TV show
from Tweets
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.
Brief Story
A King dies.
A lot of contenders wage a war to reclaim the throne.
Minor characters with no claim to the throne set their own plans in action to gain power
when all the major characters end up killing each other.
Brave/Honest/Honorable characters die.
Intelligent but shady characters and characters who know nothing
continue to live.
While humans are busy killing each other, ice zombies “White walkers” are invading from the North.
The only group who seems to care about this is neutral group called the Night’s Watch.
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.
Many characters. Anybody can die.
6 seasons (60 episodes) so far
Multiple storylines in each episode
Problem
Want to know what the audience talk about a TV show
from Tweets
Ideas
Common words Too much noise
Ideas
Common words Too much noise
Characters How o!en each character were mentioned?
I demand a trial by prototyping.CHAPTER II
Prototyping
Pull sample data from Twitter API
Entity recognition and counting naive approach
List of namesDaenerys Targaryen,Khaleesi
Jon Snow
Sansa Stark
Tyrion Lannister
Arya Stark
Cersei Lannister
Khal Drogo
Gregor Clegane,Mountain
Margaery Tyrell
Joffrey Baratheon
Bran Stark
Theon Greyjoy
Jaime Lannister
Brienne
Eddard Stark,Ned Stark
Ramsay Bolton
Sandor Clegane,Hound
Ygritte
Stannis Baratheon
Petyr Baelish,Little Finger
Robb Stark
Bronn
Varys
Catelyn Stark
Oberyn Martell
Daario Naharis
Davos Seaworth
Jorah Mormont
Melisandre
Myrcella Baratheon
Tywin Lannister
Tommen Baratheon
Grey Worm
Tyene Sand
Rickon Stark
Missandei
Roose Bolton
Robert Baratheon
Jojen Reed
Jeor Mormont
Tormund Giantsbane
Lysa Arryn
Yara Greyjoy,Asha Greyjoy
Samwell Tarly,Sam
Hodor
Victarion Greyjoy
High Sparrow
Dragon
Winter
Dothraki
Sample Tweet
Sample Tweet
Sample data
Character CountHodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
When you play the game of vis, you iterate or you die.
CHAPTER III
Where to go from here?
+ episodes
The Guardian & Google Trendshttp://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
+ emotion
+ connections
+ connections
Gain insights from a single episode emotion & connections
Sample data
Character CountJon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character CountHodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
Graph
NODES LINKS
+ top emojis + top emojis
Character CountJon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character CountHodor 1000
Jon Snow 500
Daenerys 400
… …
*These numbers are made up for presentation, not real data.
Network Visualization
Node-link diagram
Force-directed layout http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384
Issue: Hairball
Why?Too many nodes & edges
nodes = nodes.filter(n => n.count > 100)links = links.filter(l => l.count > 100)
The force is (too) strong.
force .charge(…) .gravity(…) .linkDistance(…) .linkStrength(…)
Issue: Occlusions
Tried: Fixed positions
+ Collision Detection
http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6
+ Collision Detection (with clusters)
https://bl.ocks.org/mbostock/7881887
Tormund + Brienne
Issue: Convex hull
http://bl.ocks.org/mbostock/4341699
d3.geom.hull(vertices)
x & y only, no radius
Example
Fix it
Fix it
Let’s get other episodes.
Hadoop remembers.CHAPTER IV
More data
Hadoop
Rewrite the scripts in Scalding to get archived data
How much data do we need?
Whole week?
5 days?
2 days?
A day?
etc.
How much data do we need?
Transitions
not so smooth
A#er switching episode1. Store old positions for existing objects.
2. Assign new initial positions.*
Initial positionsDefault: random
Better starting points Heuristics based on degree of nodes
A#er switching episode1. Store old positions for existing objects.
2. Assign new initial positions.*
3. Run simulation without updating <svg> for n rounds
4. Animate objects from old to new positions.
5. Resume simulation and update <svg> every tick.
Animate Nodes & LinksRemove
delay
Move & Change size/thickness
Add new
const selection = svg.selectAll('g.node') .data(nodes, d => d.entity.id);
selection.exit() .transition() .duration(1000) .style('opacity', 0) .remove();
const sEnter = selection.enter().append('g') .classed('node', true) .attr('transform', d => `translate(${d.x},${d.y})`) .style('opacity', 0) .call(force.drag);
sEnter.append('circle') .attr('r', d=>d.r) .style('fill', d => options.colorScale(d.entity.group));
const sTrans = selection.transition() .delay(1000) .duration(2000) .attr('transform', d => `translate(${d.x},${d.y})`) .style('opacity', 1)
sTrans.select('circle') .attr('r', d=>d.r)
Add “enter” nodes with opacity 0
After 1s delay, use transition to move nodes
and fade in new nodes
Fade “exit” nodes to opacity 0 and remove
Create selection
Animate CommunitiesRemove
delay
Move & Change shape*
Add new
http://blockbuilder.org/kristw/f9ffe87dd8b4038b5867e853c27cebb7
Default
t=0 t=1
Smoother
t=0 t=1t=0.5 t=0.51
Code
// originalpath.attr('d', hull);
// with custom interpolationpath.attrTween('d', (d,i,currentAttr) => interpolateHull(d, currentAttr))
ColorsDefault: d3.category10() Distinct but nothing about the context
Custom palette Colors related to the groups/houses.
Black = Night’s Watch Blue = North Red = Daenerys Gold = Lannister …
Hold the vis.CHAPTER V
The vis is not enough.
Legend
Navigation
Top 3
Adjust threshold
Recap
Filtered Recap
Tooltip
Demohttps://interactive.twitter.com/game-of-thrones
Mobile Support
A visualizer always evaluates his work.CHAPTER VI
“Feedback is the breakfast of champion.”
— Ken Blanchard
Self & Peer
Does it solve the problem?
Google Analytics
Pageviews
Visitors
Actions
Referrals Sites/Social
Feedback
Feedback
EXAMPLE 2: VISUAL ANALYTICS TOOLS
Data sources
Output
explore
analyze
present
get
*
*
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts tools for exploration
WHAT TO EXPECTricher, more features to support exploration of complex data
more technical audience product managers, engineers, data scientists
accuracy
designed for dynamic input
long-term projects
USER ACTIVITY LOGS
UsersUseTwitter
UsersUse
Product Managers
Curious
UsersUse
Curious
Engineers
Log datain Hadoop
Write Twitter
Instrument
Product Managers
WHAT ARE BEING LOGGED?
tweet
activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com tweet from search page on iPhone
activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com tweet from search page on iPhone
sign up log in
retweet etc.
activities
ORGANIZE?
LOG EVENT A.K.A. “CLIENT EVENT”
[Lee et al. 2012]
LOG EVENT A.K.A. “CLIENT EVENT”
client : page : section : component : element : actionweb : home : timeline : tweet_box : button : tweet
1) User ID 2) Timestamp 3) Event name
4) Event detail
[Lee et al. 2012]
LOG DATA
UsersUse
Curious
Engineers
Log datain Hadoop
Instrument
Write
Product Managers
bigger than Tweet data
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find
Ask
Instrument
Write
Product Managers
LOG DATA
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Instrument
Write
Product Managers
Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Instrument
Write
Product Managers
Scribe Radar
Project / Find & Monitor client events
GOALSSearch for client events
Explore client event collection
Monitor changes
CLIENT EVENT HIERARCHY
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impression
iphone:home:-:tweet:tweet:click
DETECT CHANGES
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to
CALCULATE CHANGES
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF
DISPLAY CHANGES
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
DISPLAY CHANGES
home -
- - impression
tweet tweet click
iphone
Demo Demo Demo
Demo / Scribe Radar
Twitter for Banana
WORKFLOWRequested / Identify needs
Design & Prototype Make it work for sample dataset
Refine & Generalize
Productionize
Document & Release
Maintain & Support Keep it running, Feature requests & Bugs fix
8. EXPECT TO REFINE AND POLISH
REFINE & POLISHUX / UI
Color
Animation
Mobile support
Performance Loading time, Data file size
“The little of visualisation design” by Andy Kirk http://www.visualisingdata.com/2016/03/little-visualisation-design/
9. EXPECT TO GET FEEDBACK
FEEDBACKLogging
User study
Forum, User group
Office hours
10. EXPECT TO IMPROVE
HOW TO BE BETTER?Time is limited.
Grow the team
Expand skills
Improve tooling Solve a problem once and for all
Automate repetitive tasks
http://twitter.github.io/labella.js
Demo / Labella.js
https://github.com/twitter/d3kit
Demo / d3Kithttp://www.slideshare.net/kristw/d3kit
yeoman.io
Demo / Yeoman
SUMMARY
INPUT YOU OUTPUT
EXPECT1) potential mismatches
2) different requirements
3) to clean data
4) to clean data a lot
5) to try and break things
Krist Wongsuphasawat / @kristwkristw.yellowpigz.com
6) to iterate until it works
7) deadline
8) to refine and polish
9) to get feedback
10) to improve
#VOTE
Nicolas Garcia Belmonte, Robert Harris, Miguel Rios, Simon Rogers, Jimmy Lin, Linus Lee, Chuang Liu,
and many colleagues at Twitter.
ACKNOWLEDGEMENT
RESOURCESImages Banana phone http://goo.gl/GmcMPq Bar chart https://goo.gl/1G1GBg Boss https://goo.gl/gcY8Kw Champions League http://goo.gl/DjtNKE Database http://goo.gl/5N7zZz Fishing shark http://goo.gl/2fp4zW Globe visualization http://goo.gl/UiGMMj Harry Potter http://goo.gl/Q9Cy64 Holding phone http://goo.gl/It2TzH Kiwi orange http://goo.gl/ejQ73y Kiwi http://goo.gl/9yk7o5 Library https://goo.gl/HVeE6h Library earthquake http://goo.gl/rBqBrs
Minion http://goo.gl/I19Ijg NBA http://goo.gl/p7HBdG NFL http://goo.gl/feQMZs Orange & Apple http://goo.gl/NG6RIL Pile of paper http://goo.gl/mGLQTx Premier League http://goo.gl/AqIINO Scrooge McDuck https://goo.gl/aKv8D7 The Sound of Music https://goo.gl/dqHlzj Trash pile http://goo.gl/OsFfo3 Tyrion http://goo.gl/WaBonl Watercolor Map by Stamen Design
THANK YOU
QUESTIONS?
Recommended