10 things statistics taught us about big data

Preview:

DESCRIPTION

Talk at DC Business Intelligentsia.

Citation preview

10 things statistics taught us about big data

Research Blogging Teaching

Research Blogging Teaching

jtleek.com

Research Blogging Teaching

simplystatistics.org

Research Blogging Teaching

jhudatascience.org

from: jtleek@gmail.com Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”    

from: pangwei@coursera.org Hi Jeff, 7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)  

from: rdpeng@gmail.com You are f**ed. -roger  

Enrollment

Time

Enrollment

Time

Enrollment

Time

9 classes 1 month long Every month

Enrollment

Time

1,000,000+ Enrolled

http://goo.gl/vQK0RH

http://goo.gl/xWAlPi

10 statistics things

http://goo.gl/wTAuvR

1.  Problem first, not solution backward 2.  Define a metric for success first 3.  Analyze interactively 4.  Plot your data first and always 5.  Know your real sample size 6.  Watch out for confounders 7.  Correct for multiple testing 8.  Average many predictors 9.  Smooth over time and space 10. Have others check your work

Problem first Not solution backward

http://goo.gl/3vA1OB

http://hyperboleandahalf.blogspot.com/

http://cran.r-project.org//

http://bioconductor.org/

Define a metric for success Before you start

http://www.agendia.com/managed-care/breast-cancer/mammaprint/

89% sensitivity 42% specificity 65% accuracy

http://www.biomedcentral.com/1471-2164/14/336/figure/F3

Analyze Interactively

http://had.co.nz/

https://twitter.com/EllieMcDonagh/status/469184554549248000

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

Plot your data First and always

http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

h$p://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/  

Know your real sample size

Watch out for confounders

http://xkcd.com/552/

shoe size & literacy

Correct for multiple testing

http://xkcd.com/882/

http://xkcd.com/882/

http://xkcd.com/882/

Average many predictors

5 independent, 70% accurate classifiers

10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)=

83.7% accuracy http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf Adapted from Todd Halloway

101 independent, 70% accurate classifiers

99.9% accuracy

http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf Adapted from Todd Halloway

http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/EnsembleMethods.pdf Adapted from Todd Halloway

Smooth (average) over time and space

http://simplystatistics.org/2014/02/13/loess-explained-in-a-gif/

http://fivethirtyeight.com/

Have others check your work

10 statistics things

http://goo.gl/wTAuvR

1.  Problem first, not solution backward 2.  Define a metric for success first 3.  Analyze interactively 4.  Plot your data first and always 5.  Know your real sample size 6.  Watch out for confounders 7.  Correct for multiple testing 8.  Average many predictors 9.  Smooth over time and space 10. Have others check your work

jtleek.com/talks

Recommended