Upload
eric-lindsey
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
Information Information Visualization in Data Visualization in Data MiningMining
S.T. BalkeS.T. BalkeDepartment of Chemical Department of Chemical Engineering and Applied Engineering and Applied ChemistryChemistryUniversity of TorontoUniversity of Toronto
MotivationMotivation
Data visualization Data visualization – relies primarily on human cognition for relies primarily on human cognition for
value discovery;value discovery;– permits direct incorporation of human permits direct incorporation of human
ingenuity and analytic capabilities into ingenuity and analytic capabilities into data mining;data mining;
– can very effectively deal with very large can very effectively deal with very large quantities of data;quantities of data;
– powerfully combines with machine-based powerfully combines with machine-based discovery techniques.discovery techniques.
UsesUses
Explorative AnalysisExplorative Analysis– Data cleaningData cleaning– Provide hypothesesProvide hypotheses
Confirmative AnalysisConfirmative Analysis– Confirm or reject hypothesesConfirm or reject hypotheses
PresentationPresentation– Communicate your workCommunicate your work
http://www.alz.washington.edu/DATA2001/GERALD1/sld011.htm
Calculated Properties Calculated Properties of the Anscombe Data of the Anscombe Data SetsSets
mean of the x values = 9.0
mean of the y values = 7.5
equation of the least-squared regression line is: y = 3 + 0.5x
sums of squared errors (about the mean) = 110.0
Calculated Properties Calculated Properties of the Anscombe Data of the Anscombe Data SetsSets
regression sums of squared errors (variance accounted for by x) = 27.5
residual sums of squared errors (about the regression line) = 13.75
correlation coefficient = 0.82
coefficient of determination = 0.67
The Anscombe DataThe Anscombe Data
Marley, 1885
Snow’s Cholera Map, 1855
http://pupgg.princeton.edu/disk20/anonymous/groth/lick/licknorth.gif
Graphical ExcellenceGraphical Excellence
Graphical displays should:Graphical displays should: show the datashow the data induce the viewer to think about the substance, not induce the viewer to think about the substance, not
the methodologythe methodology avoid distorting what the data saysavoid distorting what the data says present many numbers in a small spacepresent many numbers in a small space make large data sets coherentmake large data sets coherent encourage the eye to compare different pieces of dataencourage the eye to compare different pieces of data reveal the data at several levels of detail (broad reveal the data at several levels of detail (broad
overview to fine structure)overview to fine structure) serve a reasonably clear purpose: description, serve a reasonably clear purpose: description,
exploration, tabulation, or decorationexploration, tabulation, or decoration be closely integrated with the statistical and verbal be closely integrated with the statistical and verbal
descriptions of the data set.descriptions of the data set.
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Graphical ExcellenceGraphical Excellence
Gives the viewer the greatest Gives the viewer the greatest number of ideas in the shortest number of ideas in the shortest time with the least ink in the time with the least ink in the smallest space.smallest space.
Nearly always multivariate.Nearly always multivariate. Requires telling the truth about Requires telling the truth about
the data.the data.(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Lie Factor=14.8
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Lie FactorLie Factor
dataineffectofsize
graphicinshowneffectofsizeFactorLie
8.14
6.0100)6.03.5(
18100)0.185.27(
FactorLie
Require: 0.95<Lie Factor<1.05
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Using Area for One Using Area for One Dimensional DataDimensional Data
Lie Factor=2.8
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
More guidelines:More guidelines:
The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.
No legends: use labels on graph Graphics must not quote data out
of context.(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Data Ink RatioData Ink Ratio
graphictheprtousedinktotal
inkdataRatioinkData
int
Data ink Ratio = proportion of a graphic’s ink devoted to the
non-redundant display of data-information.
Data ink Ratio=1.0-(proportion of a graphic that can be erasedwithout loss of data-information)
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Maximize Data DensityMaximize Data Density
graphicdataofarea
matrixdatatheinentriesofnumbergraphicaofdensitydata
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
Beware ChartjunkBeware Chartjunk
NO
“Isn’t it remarkable that the computer can be programmedto draw like that.”
YES:
“My, what interesting data!”
(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)(E.R. Tufte, “The Visual Display of Quantitative Information”, 2nd edition)
How to Say Nothing with How to Say Nothing with Information Visualization Information Visualization
http://www.crs4.it/~zip/13ways.htmlhttp://www.crs4.it/~zip/13ways.html
Never include a color legend.Never include a color legend. Avoid annotation.Avoid annotation. Never mention error characteristics of the Never mention error characteristics of the
visualization method.visualization method. When in doubt, smooth.When in doubt, smooth. Don’t say how long it required to plot.Don’t say how long it required to plot. Never compare your results with other data Never compare your results with other data
visualization techniques.visualization techniques. Never cite references for the data.Never cite references for the data. Claim generality but show results from a single Claim generality but show results from a single
data set.data set. Use viewing angle to hide blemishes in 3D Use viewing angle to hide blemishes in 3D
objects.objects.
An Overview of An Overview of Information Information Visualization MethodsVisualization Methods
http://www.informatik.uni-http://www.informatik.uni-halle.de/~keim/tutorials.htmlhalle.de/~keim/tutorials.html
Methods of InterestMethods of Interest
Scatterplot MatricesScatterplot Matrices Parallel CoordinatesParallel Coordinates Pixel Oriented MethodsPixel Oriented Methods Icon based MethodsIcon based Methods Dimensional StackingDimensional Stacking TreemapTreemap
Assignment 1: see Assignment 1: see handouthandout
Some websites of Some websites of interest:interest: http://http://
dmoz.org/Computers/Software/Databases/Data_Miningdmoz.org/Computers/Software/Databases/Data_Mining/ / Public_Domain_SoftwarePublic_Domain_Software//
http://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Prohttp://www.cs.man.ac.uk/~ngg/InfoViz/Projects_and_Products/Visualization/ducts/Visualization/
Try a search at google.com using Try a search at google.com using the followng key words together:the followng key words together:
name_of_method download softwarename_of_method download software