Upload
krist-wongsuphasawat
View
342
Download
0
Embed Size (px)
Citation preview
Krist Wongsuphasawat / @kristw
6 THINGS TO EXPECT WHEN YOU ARE VISUALIZING
6 THINGS TO EXPECT WHEN YOU ARE VISUALIZINGKrist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
Chulalongkorn University
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Programming + Soccer
Computer Engineer Bangkok, Thailand
Krist Wongsuphasawat / @kristw
(P.S. These are actually not my robots, but our competitors’.)
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
PhD in Computer Science Information Visualization Univ. of Maryland
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
IBMMicrosoft
PhD in Computer Science Information Visualization Univ. of Maryland
PhD in Computer Science Information Visualization Univ. of Maryland
IBMMicrosoft
Data Visualization Scientist Twitter
Krist Wongsuphasawat / @kristw
Computer Engineer Bangkok, Thailand
#interactive visualizations
Open-source projects
Visual Analytics Tools
DATA =ME+ VIS
Data, I’m ready!
Data, I’m ready!
Here I come!
WHAT TO EXPECT?
1. EXPECT TO FIND THE REAL NEED
INPUT (DATA)What clients think they have
INPUT (DATA)What clients think they have What they usually have
YOUWhat clients think you are
YOUWhat clients think you are What they will get
OUTPUT (VIS)What clients ask for
OUTPUT (VIS)What clients ask for What they really need
COMMUNICATE
GOALSPresent data Communicate information effectively
Analyze data Exploratory data analysis
Tools to analyze data Reusable tools for exploration
Enjoy
Combination of above
GOALSPresent data Communicate information effectively
Analyze data Exploratory data analysis
Tools to analyze data Reusable tools for exploration
Enjoy
Combination of above
Who are the audience? What do you want to tell?
What are the questions?
Who will use this? What would they use this for?
Who are the audience?
I need this. Take this.
I need this. Here you are.
I need this. Take this.
& COMPROMISE
2. EXPECT TO CLEAN DATA
2. EXPECT TO CLEAN DATA A LOT
70-80% of time cleaning data
“DATA JANITOR”
Collect + Clean + Transform
DATA WRANGLING
WHY DOES IT TAKE SO MUCH TIME?
2.1 Many sources and data format
DATA SOURCESOpen data Publicly available
Internal data Private, owned by clients’ organization
Self-collected data Manual, site scraping, etc.
Combine the above
DATA FORMATStandalone files txt, csv, tsv, json, Google Docs, …, pdf*
Databases doesn’t necessary mean they are organized
API better quality with more overhead
Website
Big data*
NEED TO…Change format e.g. tsv => json
Combine data
Resolve multiple sources of truth
2.2 Data transformation is needed.
EXAMPLESConvert latitude/longitude into zip code
Change country code from 3-letter (USA) to 2-letter (US)
Correct time of day based on users’ timezone
etc.
2.3 Data collection issues
EXAMPLESTypos
Incorrect values
Incorrect timestamps
Missing data
2.4 Definition of “clean” data
IS THIS CLEAN?USER RESTAURANT RATING========================A MCDONALD’S 3B MCDONALDS 3C MCDONALD 4D MCDONALDS 5E IHOP 4F SUBWAY 4
IS THIS CLEAN?USER RESTAURANT RATING========================A MCDONALD’S 3B MCDONALDS 3C MCDONALD 4D MCDONALDS 5E IHOP 4F SUBWAY 4
How many reviews are there? Clean.
How many restaurants are there? Not clean. McDonald, McDonald’s, McDonalds
2.5 Bigger data, bigger problems
HAVING ALL TWEETSHow people think I feel.
How people think I feel. How I really feel.
HAVING ALL TWEETS
Hadoop Cluster
GETTING BIG DATA
Data Storage
Scalding (slow)
GETTING BIG DATAHadoop Cluster
Data Storage
Tool
Scalding (slow)
GETTING BIG DATAHadoop Cluster
Data Storage
Tool
Your laptop Smaller dataset
Hadoop Cluster
Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
CHALLENGESSlow Long processing time (hours)
Get relevant Tweets hashtag: #oscars keywords: “moonlight” (movie name)
Too big Need to aggregate & reduce size
Harder to spot problems
RAMSAY & RAMSEY
2.6 New issues can show up any time.
RECOMMENDATIONSAlways think that you will have to do it again document the process, automation
Reusable scripts break a gigantic do-it-all function into smaller ones
Reusable data keep for future project
3. EXPECT TRIALS AND ERRORS
It’s gonna be legen-
Celebrate your trials
#D3BrokeAndMadeArt
When your vis starts working
“Necessity is the mother of invention.”
— English Proverb
“Necessity is the mother of invention.”
— English Proverb
DEADLINE
EXAMPLE PROJECTS
PROJECT 1:
GAME OF THRONES #INTERACTIVE
INTERACTIVE.TWITTER.COM
WHAT TO EXPECTtimely Deadline is strict. Also can be unexpected events.
wide audience easy to explain and understand, multi-device support
one-off project
scope analyze data to find stories and find best way to present them
from fans’ conversations
Reveal the talking points of every episode of
Problem is coming.CHAPTER I
Problem
Want to know what the audience talk about a TV show
from Tweets
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.
Brief Story
A King dies.
A lot of contenders wage a war to reclaim the throne.
Minor characters with no claim to the throne set their own plans in action to gain power
when all the major characters end up killing each other.
Brave/Honest/Honorable characters die.
Intelligent but shady characters and characters who know nothing
continue to live.
While humans are busy killing each other, ice zombies “White walkers” are invading from the North.
The only group who seems to care about this is neutral group called the Night’s Watch.
HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.
Many characters. Anybody can die.
6 seasons (60 episodes) so far
Multiple storylines in each episode
Problem
Want to know what the audience talk about a TV show
from Tweets
Ideas
Common words Too much noise
Ideas
Common words Too much noise
Characters How o!en each character were mentioned?
I demand a trial by prototyping.CHAPTER II
Prototyping
Pull sample data from Twitter API
Entity recognition and counting naive approach
List of namesDaenerys Targaryen,Khaleesi
Jon Snow
Sansa Stark
Tyrion Lannister
Arya Stark
Cersei Lannister
Khal Drogo
Gregor Clegane,Mountain
Margaery Tyrell
Joffrey Baratheon
Bran Stark
Theon Greyjoy
Jaime Lannister
Brienne
Eddard Stark,Ned Stark
Ramsay Bolton
Sandor Clegane,Hound
Ygritte
Stannis Baratheon
Petyr Baelish,Little Finger
Robb Stark
Bronn
Varys
Catelyn Stark
Oberyn Martell
Daario Naharis
Davos Seaworth
Jorah Mormont
Melisandre
Myrcella Baratheon
Tywin Lannister
Tommen Baratheon
Grey Worm
Tyene Sand
Rickon Stark
Missandei
Roose Bolton
Robert Baratheon
Jojen Reed
Jeor Mormont
Tormund Giantsbane
Lysa Arryn
Yara Greyjoy,Asha Greyjoy
Samwell Tarly,Sam
Hodor
Victarion Greyjoy
High Sparrow
Dragon
Winter
Dothraki
Sample Tweet
Sample Tweet
Sample data
Character CountHodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
When you play the game of vis, you iterate or you die.
CHAPTER III
Where to go from here?
+ episodes
The Guardian & Google Trendshttp://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
+ emotion
+ connections
+ connections
Gain insights from a single episode emotion & connections
Sample data
Character CountJon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character CountHodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
Graph
NODES LINKS
+ top emojis + top emojis
Character CountJon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character CountHodor 1000
Jon Snow 500
Daenerys 400
… …
*These numbers are made up for presentation, not real data.
Network Visualization
Node-link diagram
Force-directed layout http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384
Issue: Hairball
Issue: Occlusions
Tried: Fixed positions
+ Collision Detection
http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6
+ Collision Detection (with clusters)
https://bl.ocks.org/mbostock/7881887
Tormund + Brienne
x & y only, no radius
Example
Fix it
Fix it
Let’s get other episodes.
Hadoop remembers.CHAPTER IV
More data
Hadoop
Rewrite the scripts in Scalding to get archived data
How much data do we need?
Whole week?
5 days?
2 days?
A day?
etc.
How much data do we need?
Transitions
Changing episode
A#er switching episode1. Store old positions for existing characters.
2. Assign positions for new characters.
Community transition
t=0 t=1
Smoother
t=0 t=1t=0.5 t=0.51
ColorsDefault: D3 category10 Distinct but nothing about the context
Custom palette Colors related to the groups/houses.
Black = Night’s Watch Blue = North Red = Daenerys Gold = Lannister …
Hold the vis.CHAPTER V
The vis is not enough.
Legend
Navigation
Top 3
Adjust threshold
Recap
Filtered Recap
Tooltip
Demohttps://interactive.twitter.com/game-of-thrones
Mobile Support
A visualizer always evaluates his work.CHAPTER VI
Self & Peer
Does it solve the problem?
Google Analytics
Pageviews
Visitors
Actions
Referrals Sites/Social
Feedback
Feedback
PROJECT 2:
VISUAL ANALYTICS TOOLS FOR LOGGING
WHAT TO EXPECTricher, more features to support exploration of complex data
more technical audience product managers, engineers, data scientists
accuracy
designed for dynamic input
long-term projects
Data sources
Output
explore
analyze
present
get
*
*
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts
Data sources
Output
explore
analyze
present
get
*
*
ad-hoc scripts tools for exploration
USER ACTIVITY LOGS
UsersUseTwitter
UsersUse
Product Managers
Curious
UsersUse
Curious
Engineers
Log datain Hadoop
Write Twitter
Instrument
Product Managers
WHAT ARE BEING LOGGED?
tweet
Activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com tweet from search page on iPhone
Activities
WHAT ARE BEING LOGGED?
tweet from home timeline on twitter.com tweet from search page on iPhone
sign up log in
retweet etc.
Activities
ORGANIZE?
LOG EVENT A.K.A. “CLIENT EVENT”
[Lee et al. 2012]
LOG EVENT A.K.A. “CLIENT EVENT”
client : page : section : component : element : actionweb : home : timeline : tweet_box : button : tweet
1) User ID 2) Timestamp 3) Event name
4) Event detail
[Lee et al. 2012]
LOG DATA
UsersUse
Curious
Engineers
Log datain Hadoop
Instrument
Write
Product Managers
bigger than Tweet data
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find
Ask
Instrument
Write
Product Managers
LOG DATA
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Instrument
Write
Product Managers
Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Instrument
Write
Product Managers
client page section component element action
Event50,000+ event types
client page section component element action
Event50,000+ event types
one graph / event
x 50,000
DESIGN
See
Client event collection
Engineers & Data Scientists
See
Client event collection
Engineers & Data Scientists
narrow down
Interactions search box => filter
See
HOW TO VISUALIZE?
narrow down
Client event collection
Engineers & Data Scientists
Interactions search box => filter
See
Client event collection
Engineers & Data Scientists
client : page : section : component : element : action
HOW TO VISUALIZE?
narrow down
Interactions search box => filter
CLIENT EVENT HIERARCHY
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impression
iphone:home:-:tweet:tweet:click
DETECT CHANGES
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to
CALCULATE CHANGES
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF
DISPLAY CHANGES
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
DISPLAY CHANGES
home -
- - impression
tweet tweet click
iphone
Demo Demo Demo
Demo / Scribe Radar
Twitter for Banana
PROJECT 3:
VISUAL ANALYTICS TOOLS FOR EXPERIMENTATION
A/B TESTING
RUN AN EXPERIMENTDevelop feature
Track metrics 1. No. of Tweets read 2. No. of Tweets sent 3. No. of Users 4. …
Set bucket size How many users?
RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.
Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.
RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.
Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.
Amount of knowledge transfer = slide deck + wiki page.
Reproduce for recent experiments? Manually.
RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.
Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.
Amount of knowledge transfer = slide deck + wiki page.
Reproduce for recent experiments? Manually.
Make results more accessible and convenient to use.
RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.
Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.
Amount of knowledge transfer = slide deck + wiki page.
Reproduce for recent experiments? Manually.
Make results more accessible and convenient to use.
Automatic
Metric MoverI like to move it, move it
Krist Wongsuphasawat, Joseph Liu, Matthew Schreiner, Andy Schlaikjer, Lucile Lu and Busheng Lou
Set OKRs
Process
# of posts
Implement a feature
Set OKRs
Process
Setup experiment
# of posts
# of posts
Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
Setup experiment
# of posts
# of posts
Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
Setup experiment
How easy/hard it is to move this metric?How much change to aim for?
Challenges
# of posts
# of posts
Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
How much to expect from one experiment?What were the successful features?Who had experience with this?Setup experiment
How easy/hard it is to move this metric?How much change to aim for?
Challenges
# of posts
# of posts
Implement a feature
Set OKRs
Interpret results
Process
Run experiment
+1.0%
How much to expect from one experiment?What were the successful features?Who had experience with this?Setup experiment
How easy/hard it is to move this metric?How much change to aim for?
How good is this?
Challenges
# of posts
# of posts
Past experiments
Metric Mover
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Metric: No. of Posts
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Metric: No. of PostsControl buckets
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Metric: No. of Posts
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Metric: No. of Posts
Insignificant buckets
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Metric: No. of Posts
Metric: No. of Posts
Metric: No. of Posts
% change
0-1%-2% 2%1%
Metric: No. of Posts
% change
0-1%-2% 2%1%
|scaled impact|
100,000,000
1,000,000
10,000
100
Users who watch cat GIFs Users who like cat GIFs Users who post cat GIFs
**These are fake data.**
WORKFLOWIdentify needs
Design and prototype Make it work for sample dataset
Refine, generalize and productionize Make it work for other cases
Document and release
Maintain and support Keep it running, Feature requests & Bugs fix
What separates good and great work
4. EXPECT TIME FOR REFINEMENT
REFINE & POLISHUX / UI + Mobile Support
Color
Animation / Transition
Performance Loading time, Data file size
“The little of visualisation design” by Andy Kirk http://www.visualisingdata.com/2016/03/little-visualisation-design/
“The first 90% of the code accounts for the first 90% of the development time.
The remaining 10% of the code accounts for the other 90% of the development time.”
— Tom Cargill, Bell Labs
or find ways to get some
5. EXPECT FEEDBACK
“Feedback is the breakfast of champion.”
— Ken Blanchard
FEEDBACKDuring development Feedback sessions with clients/potential users
After release Logging User study Forum, User group Office hours
6. EXPECT TO IMPROVE
HOW TO BE BETTER?Time is limited.
Learn from the past
Expand skills
Get help / Grow the team
Improve tooling Solve a problem once and for all
Automate repetitive tasks
https://github.com/twitter/d3kit
Demo / d3Kithttp://www.slideshare.net/kristw/d3kit
SUMMARY
EXPECT…1. to find the real need
2. to clean data a lot
3. trials and errors
4. time for refinement
5. feedback
6. to improveKrist Wongsuphasawat / @kristw
kristw.yellowpigz.com
THANK YOU
QUESTIONS?
My colleagues at Twitter for their collaboration and support in these projects;
and my wife for taking care of the baby while I make these slides.
ACKNOWLEDGEMENT
RESOURCESImages Banana phone http://goo.gl/GmcMPq Bar chart https://goo.gl/1G1GBg Boss https://goo.gl/gcY8Kw Champions League http://goo.gl/DjtNKE Database http://goo.gl/5N7zZz Fishing shark http://goo.gl/2fp4zW Frustrated programmer https://goo.gl/ZLDNny Globe visualization http://goo.gl/UiGMMj Harry Potter http://goo.gl/Q9Cy64 Holding phone http://goo.gl/It2TzH Jon Snow https://goo.gl/CACWxE Jon Snow lightsaber https://goo.gl/CJt1Tn Kiwi orange http://goo.gl/ejQ73y
Kiwi http://goo.gl/9yk7o5 Library https://goo.gl/HVeE6h Library earthquake http://goo.gl/rBqBrs Minion http://goo.gl/I19Ijg Nemo https://goo.gl/m0pmzC Orange & Apple http://goo.gl/NG6RIL Pile of paper http://goo.gl/mGLQTx Scrooge McDuck https://goo.gl/aKv8D7 Trash pile http://goo.gl/OsFfo3 Watercolor Map by Stamen Design Yes GIF https://goo.gl/agvlAE