98
Data Visualization class 5 Vivian Zhang | Scott Kostyshak CTO @Supstat Inc | Data Scientist @Supstat Inc Data Visualization http://nycdatascience.com/part4_en/ 1 of 98 2/4/14, 7:31 AM

Data visualization

Embed Size (px)

DESCRIPTION

I am sharing the slides I used for teaching my "Data Science by R" class. You can sign up a class at http://www.nycdatascience.com/ ----NYC Data Science Academy. We offer classes in R, Python, Processing, D3.js, Hadoop, and etc.

Citation preview

Page 1: Data visualization

DDaattaa VViissuuaalliizzaattiioonnclass 5

Vivian Zhang | Scott KostyshakCTO @Supstat Inc | Data Scientist @Supstat Inc

Data Visualization http://nycdatascience.com/part4_en/

1 of 98 2/4/14, 7:31 AM

Page 2: Data visualization

DDaattaa vviissuuaalliizzaattiioonnWe will study the application of primary drawing functions and advanced drawing functions in R andwill focus on understanding the methods of data exploration by visualization.

Case study and excercise: Analyzing the NBA data with graphics

The related functions in R

The properties of a single variable

Displaying compositions

The relationship between variables

Exhibiting change over time

Geographic information

·

·

·

·

·

·

Data Visualization http://nycdatascience.com/part4_en/

2 of 98 2/4/14, 7:31 AM

Page 3: Data visualization

Why use visualization?

Data Visualization http://nycdatascience.com/part4_en/

3 of 98 2/4/14, 7:31 AM

Page 4: Data visualization

DDaattaa vviissuuaalliizzaattiioonnA figure is worth a thousand words.

data <- read.table('data/anscombe.txt',T)data <- data[,-1]head(data)

x1 x2 x3 x4 y1 y2 y3 y41 10 10 10 8 8.04 9.14 7.46 6.582 8 8 8 8 6.95 8.14 6.77 5.763 13 13 13 8 7.58 8.74 12.74 7.714 9 9 9 8 8.81 8.77 7.11 8.845 11 11 11 8 8.33 9.26 7.81 8.476 14 14 14 8 9.96 8.10 8.84 7.04

Data Visualization http://nycdatascience.com/part4_en/

4 of 98 2/4/14, 7:31 AM

Page 5: Data visualization

DDaattaa vviissuuaalliizzaattiioonnTry to calculate some statistical indicators. First calculate the mean of these datasets, and thencalculate the correlation coefficient of the four groups of data

colMeans(data)

x1 x2 x3 x4 y1 y2 y3 y4 9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5

sapply(1:4,function(x) cor(data[,x],data[,x+4]))

[1] 0.816 0.816 0.816 0.817

Data Visualization http://nycdatascience.com/part4_en/

5 of 98 2/4/14, 7:31 AM

Page 6: Data visualization

DDaattaa vviissuuaalliizzaattiioonn

Data Visualization http://nycdatascience.com/part4_en/

6 of 98 2/4/14, 7:31 AM

Page 7: Data visualization

SSoommee bbaassiicc pprriinncciipplleessDetermine the target of visualization from the beginning1.

Understanding the characteristics of the data and the audience2.

Keep concise but give enough information3.

Exploratory visualization

Explanatory visualization

·

·

Which variables are important and interesting

Consider the role and background of the audience

Select a proper mapping

·

·

·

Data Visualization http://nycdatascience.com/part4_en/

7 of 98 2/4/14, 7:31 AM

Page 8: Data visualization

MMaappppiinngg eelleemmeennttss ooff aa ggrraapphh::Coordinate position1.

Line2.

Size3.

Color4.

Shape5.

Text6.

Data Visualization http://nycdatascience.com/part4_en/

8 of 98 2/4/14, 7:31 AM

Page 9: Data visualization

Visualization functions in R

Data Visualization http://nycdatascience.com/part4_en/

9 of 98 2/4/14, 7:31 AM

Page 10: Data visualization

VViissuuaalliizzaattiioonn ffuunnccttiioonnss iinn RRbase graphics

lattice

ggplot2

·

·

·

Data Visualization http://nycdatascience.com/part4_en/

10 of 98 2/4/14, 7:31 AM

Page 11: Data visualization

EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnssplot(cars$dist~cars$speed)

Data Visualization http://nycdatascience.com/part4_en/

11 of 98 2/4/14, 7:31 AM

Page 12: Data visualization

EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnssplot(cars$dist,type='l')

Data Visualization http://nycdatascience.com/part4_en/

12 of 98 2/4/14, 7:31 AM

Page 13: Data visualization

EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnssplot(cars$dist,type='h')

Data Visualization http://nycdatascience.com/part4_en/

13 of 98 2/4/14, 7:31 AM

Page 14: Data visualization

EElleemmeennttaarryy ggrraapphhiinngg ffuunnccttiioonnsshist(cars$dist)

Data Visualization http://nycdatascience.com/part4_en/

14 of 98 2/4/14, 7:31 AM

Page 15: Data visualization

llaattttiiccee ppaacckkaaggeelibrary(lattice)num <- sample(1:3,size=50,replace=T)barchart(table(num))

Data Visualization http://nycdatascience.com/part4_en/

15 of 98 2/4/14, 7:31 AM

Page 16: Data visualization

llaattttiiccee ppaacckkaaggeeqqmath(rnorm(100))

Data Visualization http://nycdatascience.com/part4_en/

16 of 98 2/4/14, 7:31 AM

Page 17: Data visualization

llaattttiiccee ppaacckkaaggeestripplot(~ Sepal.Length | Species, data = iris,layout=c(1,3))

Data Visualization http://nycdatascience.com/part4_en/

17 of 98 2/4/14, 7:31 AM

Page 18: Data visualization

llaattttiiccee ppaacckkaaggeedensityplot(~ Sepal.Length, groups=Species, data = iris,plot.points=FALSE)

Data Visualization http://nycdatascience.com/part4_en/

18 of 98 2/4/14, 7:31 AM

Page 19: Data visualization

llaattttiiccee ppaacckkaaggeebwplot(Species~ Sepal.Length, data = iris)

Data Visualization http://nycdatascience.com/part4_en/

19 of 98 2/4/14, 7:31 AM

Page 20: Data visualization

llaattttiiccee ppaacckkaaggeexyplot(Sepal.Width~ Sepal.Length, groups=Species, data = iris)

Data Visualization http://nycdatascience.com/part4_en/

20 of 98 2/4/14, 7:31 AM

Page 21: Data visualization

llaattttiiccee ppaacckkaaggeesplom(iris[1:4])

Data Visualization http://nycdatascience.com/part4_en/

21 of 98 2/4/14, 7:31 AM

Page 22: Data visualization

llaattttiiccee ppaacckkaaggeehistogram(~ Sepal.Length | Species, data = iris,layout=c(1,3))

Data Visualization http://nycdatascience.com/part4_en/

22 of 98 2/4/14, 7:31 AM

Page 23: Data visualization

TThhrreeee--ddiimmeennssiioonnaall ggrraapphhss iinn tthhee llaattttiicceeppaacckkaaggeelibrary(plyr)func3d <- function(x,y) { sin(x^2/2 - y^2/4) * cos(2*x - exp(y))}vec1 <- vec2 <- seq(0,2,length=30)para <- expand.grid(x=vec1,y=vec2)result6 <- mdply(.data=para,.fun=func3d)

Data Visualization http://nycdatascience.com/part4_en/

23 of 98 2/4/14, 7:31 AM

Page 24: Data visualization

TThhrreeee--ddiimmeennssiioonnaall ggrraapphhss iinn tthhee llaattttiicceeppaacckkaaggeelibrary(lattice)wireframe(V1~x*y,data=result6,scales = list(arrows = FALSE), drape = TRUE, colorkey = F)

Data Visualization http://nycdatascience.com/part4_en/

24 of 98 2/4/14, 7:31 AM

Page 25: Data visualization

ggggpplloott ppaacckkaaggeeData, Mapping and Geom

library(ggplot2)p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point()print(p)

Data Visualization http://nycdatascience.com/part4_en/

25 of 98 2/4/14, 7:31 AM

Page 26: Data visualization

ggggpplloott ppaacckkaaggeeObserve the internal structure

summary(p)

data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11]mapping: x = cty, y = hwyfaceting: facet_null() -----------------------------------geom_point: na.rm = FALSE stat_identity: position_identity: (width = NULL, height = NULL)

Data Visualization http://nycdatascience.com/part4_en/

26 of 98 2/4/14, 7:31 AM

Page 27: Data visualization

ggggpplloott ppaacckkaaggeeAdd other data mappings

p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))p <- p + geom_point()print(p)

Data Visualization http://nycdatascience.com/part4_en/

27 of 98 2/4/14, 7:31 AM

Page 28: Data visualization

ggggpplloott ppaacckkaaggeeAdd a statistical transformation such as a smooth

p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))p <- p + geom_smooth()print(p)

Data Visualization http://nycdatascience.com/part4_en/

28 of 98 2/4/14, 7:31 AM

Page 29: Data visualization

ggggpplloott ppaacckkaaggeeAdd points and smooth lines on the plot layer

p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth()

Data Visualization http://nycdatascience.com/part4_en/

29 of 98 2/4/14, 7:31 AM

Page 30: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

30 of 98 2/4/14, 7:31 AM

Page 31: Data visualization

ggggpplloott ppaacckkaaggeeScale control

p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() + scale_color_manual(values=c('blue2','red4'))

Data Visualization http://nycdatascience.com/part4_en/

31 of 98 2/4/14, 7:31 AM

Page 32: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

32 of 98 2/4/14, 7:31 AM

Page 33: Data visualization

ggggpplloott ppaacckkaaggeeFacet control

p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() + scale_color_manual(values=c('blue2','red4')) + facet_wrap(~ year,ncol=1)

Data Visualization http://nycdatascience.com/part4_en/

33 of 98 2/4/14, 7:31 AM

Page 34: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

34 of 98 2/4/14, 7:31 AM

Page 35: Data visualization

ggggpplloott ppaacckkaaggeePolishing your plots for publication

p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=class,size=displ), alpha=0.5,position = "jitter") + geom_smooth() + scale_size_continuous(range = c(4, 10)) + facet_wrap(~ year,ncol=1) + opts(title='Vehicle model and fuel consumption') + labs(y='Highway miles per gallon', x='Urban miles per gallon', size='Displacement', colour = 'Model')

Data Visualization http://nycdatascience.com/part4_en/

35 of 98 2/4/14, 7:31 AM

Page 36: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

36 of 98 2/4/14, 7:31 AM

Page 37: Data visualization

ggggpplloott eexxeerrcciissee IIchange the coordinate system,such as coord_flip() , coord_polar(),coord_cartesian()

p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year),size=displ), alpha=0.5,position = "jitter")+ stat_smooth()+ scale_color_manual(values =c('steelblue','red4'))+ scale_size_continuous(range = c(4, 10))

Data Visualization http://nycdatascience.com/part4_en/

37 of 98 2/4/14, 7:31 AM

Page 38: Data visualization

The properties of a single variable

Data Visualization http://nycdatascience.com/part4_en/

38 of 98 2/4/14, 7:31 AM

Page 39: Data visualization

HHiissttooggrraammlibrary(ggplot2)p <- ggplot(data=iris,aes(x=Sepal.Length))+ geom_histogram()print(p)

Data Visualization http://nycdatascience.com/part4_en/

39 of 98 2/4/14, 7:31 AM

Page 40: Data visualization

HHiissttooggrraammWe can customize the histogram as follows:

p <- ggplot(iris,aes(x=Sepal.Length))+ geom_histogram(binwidth=0.1, # Set the group gap fill='skyblue', # Set the fill color colour='black') # Set the border color

Data Visualization http://nycdatascience.com/part4_en/

40 of 98 2/4/14, 7:31 AM

Page 41: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

41 of 98 2/4/14, 7:31 AM

Page 42: Data visualization

HHiissttooggrraammss pplluuss ddeennssiittyy ccuurrvveeThe main role of the histogram of is to show counting by groups and distribution characteristics. Thedistribution of a sample in traditional statistics is of important significance. But there is anothermethod that can also show the distribution of data, namely the kernel density estimation curve. Wecan estimate a density curve that represents the distribution, according to the data. We can displaythe histogram and density curve at the same time.

p <- ggplot(iris,aes(x=Sepal.Length)) + geom_histogram(aes(y=..density..), fill='skyblue', color='black') + geom_density(color='black', linetype=2,adjust=2)

Data Visualization http://nycdatascience.com/part4_en/

42 of 98 2/4/14, 7:31 AM

Page 43: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

43 of 98 2/4/14, 7:31 AM

Page 44: Data visualization

DDeennssiittyy ccuurrvveeSimilar to the window width parameter, the adjust parameter will control the presentation of thedensity curve. We try different parameters to draw mutiple density curves. The smaller the parameteris, the more volatile and sensitive the curve is.

p <- ggplot(iris,aes(x=Sepal.Length)) + geom_histogram(aes(y=..density..), # Note: set y to relative frequency fill='gray60', color='gray') + geom_density(color='black',linetype=1,adjust=0.5) + geom_density(color='black',linetype=2,adjust=1) + geom_density(color='black',linetype=3,adjust=2)

Data Visualization http://nycdatascience.com/part4_en/

44 of 98 2/4/14, 7:31 AM

Page 45: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

45 of 98 2/4/14, 7:31 AM

Page 46: Data visualization

DDeennssiittyy ccuurrvveeDensity curve is also convenient for comparison between different data. For example, we want tocompare the Sepal.Length distribution of three different flowers of the iris, like this:

p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray')print(p)

Data Visualization http://nycdatascience.com/part4_en/

46 of 98 2/4/14, 7:31 AM

Page 47: Data visualization

BBooxxpplloottIn addition to the histograms and density map, We can also use boxplots to show the distribution ofone-dimensional data. The boxplot is also convenient for comparison of different data.

p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot()print(p)

Data Visualization http://nycdatascience.com/part4_en/

47 of 98 2/4/14, 7:31 AM

Page 48: Data visualization

VViioolliinn pplloottA violin plot contains more information than a boxplot about the (sub-)distributions of the data:

p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin()print(p)

Data Visualization http://nycdatascience.com/part4_en/

48 of 98 2/4/14, 7:31 AM

Page 49: Data visualization

VViioolliinn pplloott pplluuss ppooiinnttssp <- ggplot(iris,aes(x=Species,y=Sepal.Length, fill=Species)) + geom_violin(fill='gray',alpha=0.5) + geom_dotplot(binaxis = "y", stackdir = "center")print(p)

Data Visualization http://nycdatascience.com/part4_en/

49 of 98 2/4/14, 7:31 AM

Page 50: Data visualization

Displaying compositions

Data Visualization http://nycdatascience.com/part4_en/

50 of 98 2/4/14, 7:31 AM

Page 51: Data visualization

BBaarr cchhaarrttThe proportion of each vehicle model in the mpg dataset and these proportions grouped by years

p <- ggplot(mpg,aes(x=class)) + geom_bar()print(p)

Data Visualization http://nycdatascience.com/part4_en/

51 of 98 2/4/14, 7:31 AM

Page 52: Data visualization

SSttaacckkeedd bbaarr cchhaarrttThe proportion of each vehicle model in the mpg dataset and these proportions grouped by years

mpg$year <- factor(mpg$year)p <- ggplot(mpg,aes(x=class,fill=year)) + geom_bar(color='black')

Data Visualization http://nycdatascience.com/part4_en/

52 of 98 2/4/14, 7:31 AM

Page 53: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

53 of 98 2/4/14, 7:31 AM

Page 54: Data visualization

SSttaacckkeedd bbaarr cchhaarrttStacked bar chart

p <- ggplot(mpg,aes(x=class,fill=year)) + geom_bar(color='black', position=position_dodge())

Data Visualization http://nycdatascience.com/part4_en/

54 of 98 2/4/14, 7:31 AM

Page 55: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

55 of 98 2/4/14, 7:31 AM

Page 56: Data visualization

PPiiee cchhaarrttp <- ggplot(mpg, aes(x = factor(1), fill = factor(class))) + geom_bar(width = 1)+ coord_polar(theta = "y")

Data Visualization http://nycdatascience.com/part4_en/

56 of 98 2/4/14, 7:31 AM

Page 57: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

57 of 98 2/4/14, 7:31 AM

Page 58: Data visualization

RRoossee ddiiaaggrraammWind rose, a commonly used graphics tool by meteorologists, describes the wind speed anddirection distributions in a specific place.

set.seed(1)# Randomly generate 100 wind directions, and divide them into 16 intervals.dir <- cut_interval(runif(100,0,360),n=16)# Randomly generate 100 wind speed, and divide them into 4 intensities.mag <- cut_interval(rgamma(100,15),4) sample <- data.frame(dir=dir,mag=mag)# Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coordinates of p <- ggplot(sample,aes(x=dir,fill=mag)) + geom_bar()+ coord_polar()

Data Visualization http://nycdatascience.com/part4_en/

58 of 98 2/4/14, 7:31 AM

Page 59: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

59 of 98 2/4/14, 7:31 AM

Page 60: Data visualization

MMoossaaiicc PPlloottDivide the data according to different variables, and then use rectangles of different sizes torepresent different groups of data. Let's look at the gender breakdown of survivors:

Data Visualization http://nycdatascience.com/part4_en/

60 of 98 2/4/14, 7:31 AM

Page 61: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

61 of 98 2/4/14, 7:31 AM

Page 62: Data visualization

TThhee pprrooppoorrttiioonn ssttrruuccttuurree ooff ccoonnttiinnuuoouuss ddaattaadata <- read.csv('data/soft_impact.csv',T)library(reshape2)data.melt <- melt(data,id='Year')p <- ggplot(data.melt,aes(x=Year,y=value, group=variable,fill=variable)) + geom_area(color='black',size=0.3, position=position_fill()) + scale_fill_brewer()

Data Visualization http://nycdatascience.com/part4_en/

62 of 98 2/4/14, 7:31 AM

Page 63: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

63 of 98 2/4/14, 7:31 AM

Page 64: Data visualization

The relationship between variables

Data Visualization http://nycdatascience.com/part4_en/

64 of 98 2/4/14, 7:31 AM

Page 65: Data visualization

SSccaatttteerr ddiiaaggrraammShow the relationship between two variables with a scatter diagram.

p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point()print(p)

Data Visualization http://nycdatascience.com/part4_en/

65 of 98 2/4/14, 7:31 AM

Page 66: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaampg$year <- factor(mpg$year)p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year))print(p)

Data Visualization http://nycdatascience.com/part4_en/

66 of 98 2/4/14, 7:31 AM

Page 67: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaRepresent different years with different shapes

mpg$year <- factor(mpg$year)p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year))print(p)

Data Visualization http://nycdatascience.com/part4_en/

67 of 98 2/4/14, 7:31 AM

Page 68: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaWith large data sets, the points in a scatter plot may obscure each other due to overplotting, we canmake some random disturbance to solve this problem.

p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position = print(p)

Data Visualization http://nycdatascience.com/part4_en/

68 of 98 2/4/14, 7:31 AM

Page 69: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaFor the trend of the scatterplot, we can draw out the regression line.

p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position = "jitter") + geom_smooth(method='lm')print(p)

Data Visualization http://nycdatascience.com/part4_en/

69 of 98 2/4/14, 7:31 AM

Page 70: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaIn addition to color, We can also use the size of the dot to reflect another variable, such as the sizeof the cylinder. Some refer to plots like this as "bubble charts".

p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") + geom_smooth(method='lm') + scale_size_continuous(range = c(4, 10))

Data Visualization http://nycdatascience.com/part4_en/

70 of 98 2/4/14, 7:31 AM

Page 71: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

71 of 98 2/4/14, 7:31 AM

Page 72: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaAlthough we can show all the variables in a picture, we can also split it into multiple pictures to showthe characteristics of different variables. This method is called grouping, conditioning, or faceting.

p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(colour=class,size=displ), alpha=0.5,position = "jitter") + geom_smooth() + scale_size_continuous(range = c(4, 10)) + facet_wrap(~ year,ncol=1)

Data Visualization http://nycdatascience.com/part4_en/

72 of 98 2/4/14, 7:31 AM

Page 73: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

73 of 98 2/4/14, 7:31 AM

Page 74: Data visualization

ggggpplloott eexxeerrcciissee IIIImake scatter plot for diamond data

use transparency and small size points, look into size and alpha option in geom_point()

use bin chart to observe intensity of points,look into stat_bin2d()

estimate data dentisy,look into stat_density2d() and use+cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000))

·

·

·

·

Data Visualization http://nycdatascience.com/part4_en/

74 of 98 2/4/14, 7:31 AM

Page 75: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

75 of 98 2/4/14, 7:31 AM

Page 76: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

76 of 98 2/4/14, 7:31 AM

Page 77: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

77 of 98 2/4/14, 7:31 AM

Page 78: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaThe typical scatter plot is to show a relationship between two variables. When you want to look atmany bivariate relationships at once, you can use a scatter plot matrix.

Data Visualization http://nycdatascience.com/part4_en/

78 of 98 2/4/14, 7:31 AM

Page 79: Data visualization

SSccaatttteerr pplloott ooff mmuullttiiddiimmeennssiioonnaall ddaattaaif given many numerical variables, concentrated display can be done.

Data Visualization http://nycdatascience.com/part4_en/

79 of 98 2/4/14, 7:31 AM

Page 80: Data visualization

Change over time

Data Visualization http://nycdatascience.com/part4_en/

80 of 98 2/4/14, 7:31 AM

Page 81: Data visualization

CChhaannggee oovveerr ttiimmeeFor visualization of time series data, the first step is looking at how the variable changes over time.For example, we'll have a look at American employment GDP data visualization.

fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4')p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) + geom_bar(stat='identity', fill=fillcolor)

Data Visualization http://nycdatascience.com/part4_en/

81 of 98 2/4/14, 7:31 AM

Page 82: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

82 of 98 2/4/14, 7:31 AM

Page 83: Data visualization

CChhaannggee oovveerr ttiimmeeFor the time series of small amount of data, we can use the bar graph to display. At the same timedisplay the number of positive and negative values with different colors.For the time series of largescale data, the bar will be crowded, and lines and points can be used to represent the strip.

p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) + geom_linerange(color='grey20',size=0.5) + geom_point(aes(y=psavert),color='red4') + theme_bw()

Data Visualization http://nycdatascience.com/part4_en/

83 of 98 2/4/14, 7:31 AM

Page 84: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

84 of 98 2/4/14, 7:31 AM

Page 85: Data visualization

CChhaannggee oovveerr ttiimmeeWhen the data is more intensive, we can use line graph or area chart to show the change of a trend.Also, some important time points or time interval can be marked in the time series graph, such asmarking 80's as a key time.

fill.color <- ifelse(economics$date > '1980-01-01' & economics$date < '1990-01-01', 'steelblue','red4')p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) + geom_linerange(color=fill.color,size=0.9) + geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") + theme_bw()

Data Visualization http://nycdatascience.com/part4_en/

85 of 98 2/4/14, 7:31 AM

Page 86: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

86 of 98 2/4/14, 7:31 AM

Page 87: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

87 of 98 2/4/14, 7:31 AM

Page 88: Data visualization

Geographic informationvisualization

Data Visualization http://nycdatascience.com/part4_en/

88 of 98 2/4/14, 7:31 AM

Page 89: Data visualization

MMaappTwo types of drawing map

Download the geographic information data, and then draw the geographical boundaries, andidentify areas and locations according to the need

Download bitmap data of Google map, and then mark the location and path information on thegoogle map

·

·

Data Visualization http://nycdatascience.com/part4_en/

89 of 98 2/4/14, 7:31 AM

Page 90: Data visualization

MMaappworld map

library(ggplot2)world <- map_data("world")worldmap <- ggplot(world, aes(x=long, y=lat, group=group)) + geom_path(color='gray10',size=0.3) + geom_point(x=114,y=30,size=10,shape='*') + scale_y_continuous(breaks=(-2:2) * 30) + scale_x_continuous(breaks=(-4:4) * 45) + coord_map("ortho", orientation=c(30, 120, 0)) + theme(panel.grid.major = element_line(colour = "gray50"), panel.background = element_rect(fill = "white"), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank())

Data Visualization http://nycdatascience.com/part4_en/

90 of 98 2/4/14, 7:31 AM

Page 91: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

91 of 98 2/4/14, 7:31 AM

Page 92: Data visualization

mmaapp ooff tthhee UU..SS..map <- map_data('state')arrests <- USArrestsnames(arrests) <- tolower(names(arrests))arrests$region <- tolower(rownames(USArrests))

usmap <- ggplot(data=arrests) + geom_map(map =map,aes(map_id = region,fill = murder),color='gray40' ) + expand_limits(x = map$long, y = map$lat) + scale_fill_continuous(high='red2',low='white') + theme_bw() + theme(panel.grid.major = element_blank(), panel.background = element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), legend.position = c(0.95,0.28), legend.background=element_rect(fill="white", colour="white"))+ coord_map('mercator'

Data Visualization http://nycdatascience.com/part4_en/

92 of 98 2/4/14, 7:31 AM

Page 93: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

93 of 98 2/4/14, 7:31 AM

Page 94: Data visualization

DDrraawwiinngg aa mmaapp ooff CChhiinnaa bbaasseedd oonn aa bbiittmmaappAnother method to drawing China map is to download a document containing bitmap data fromGoogle or openstreetmap, and then to overlap points and lines elements on it with ggplot2. Thisdocument does not include information of latitude and longitude, just a simple bitmap, for fastmapping.

library(ggmap)library(XML)webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)raw <- tables[[6]]data <- raw[,c(1,3,4)]names(data) <- c('date','lan','lon')data$lan <- as.numeric(data$lan)data$lon <- as.numeric(data$lon)data$date <- as.Date(data$date, "%Y-%m-%d")#Read the map data from Google by the ggmap package, and mark the previous data on the map.earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device' geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+ theme(legend.position = "none")

Data Visualization http://nycdatascience.com/part4_en/

94 of 98 2/4/14, 7:31 AM

Page 95: Data visualization

Data Visualization http://nycdatascience.com/part4_en/

95 of 98 2/4/14, 7:31 AM

Page 96: Data visualization

RR aanndd iinntteerraaccttiivvee vviissuuaalliizzaattiioonnGoogleVis is R package providing a interface between R and Google visualization API. It allows theuser to use the Google Visualization API for data visualization without the need to upload data.

We want to compare the development trajectory of 20 country group over the past several years. Inorder to obtain the data, we selected three variables from the world bank database, which reflect thechange of GDP, CO2 emissions and life expectancy between 2001 to 2009.

library(googleVis)library(WDI)DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US'M <- gvisMotionChart(DF, idvar="country", timevar="year", xvar='EN.ATM.CO2E.KT', yvar='NY.GDP.MKTP.CD')plot(M)

Data Visualization http://nycdatascience.com/part4_en/

96 of 98 2/4/14, 7:31 AM

Page 97: Data visualization

Case study and excercise

Data Visualization http://nycdatascience.com/part4_en/

97 of 98 2/4/14, 7:31 AM

Page 98: Data visualization

EExxeerrcciissee IIIIII:: AAnnaallyyzziinngg NNBBAA ddaattaaCalculate the seasonal winning rate, and draw a bar chart

Calculating the seasonal winning rate at home and on the road, and draw a bar chart

According to the seasonal scores of home side, draw a set of four histograms

According to the seasonal scores of home side,draw the boxplots of five seasons

Draw the boxplots of scores of all competitions for home side and opposite side

Calculate the average and winning percentage for each opponent, and make a scatterplot to findthe strong and the weak team.

·

·

·

·

·

·

Data Visualization http://nycdatascience.com/part4_en/

98 of 98 2/4/14, 7:31 AM