Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Quick overview of basic concepts in Sta4s4cs
(andafewbadjokes)
Mo4va4on
Sta1s1calconceptscanbesubtleanddifficulttoexplain
IthoughtitwouldbeusefultoreviewsomecentralideasinSta1s1cs
Overview
Descrip1vesta1s1cs• Distribu1ons,sampling• Descrip1vesta1s1csandplots• Samplingdistribu1ons
Inferen1alsta1s1cs
• Pointandintervales1mates• Confidenceintervalsandthebootstrap• Hypothesistests
Ifthereisinterest:neuraldataanalysis
Where do data come from?
Distribu(ons!
Truth!DataL
How do we get data?
Simplerandomsample:eachmemberinthepopula1onisequallylikelytobeinthesample• Randomselec1on
Soupanalogy!
Q:Whyisthisgood?
An Example Dataset (flight delays) Cases
Variables
CategoricalVariable Quan1ta1veVariable
Cases
An Example Dataset (flight delays)
What is a good first step when analyzing data?
Whatisausefulplotforcategoricaldata?
Whatisausefulplotforquan1ta1vedata?
Box and violin plots
Joy plots
Neatfact:Somesta1s1ciansfindviolinplotsugly• Whydotheyfindthemugly?
X-massornaments
Dynamite plots
Neatfact:Manysta1s1cianshatedynamiteplots• Whydotheyhatethem?
What is a sta4s4c?
A:Asingledeathisatragedy;amilliondeathsisasta1s1c.
-JosephStallin
Descrip1vesta1s1csdescribethedatayouhavecollected• Usuallydenotedwithromancharacters
• Asinglecategoricalvariable:propor1on • Asinglequan1ta1vevariable:themean • Apairofquan1ta1vevariables:thecorrela1on
Example sta4s4c: correla4on coefficient
Thecorrela1oncoefficientrisasta1s1cthatmeasuresthelinearassocia1onbetweentwovariablesX,andY
risanes1mateofρ
r
Side note: correla4on ≠ causa4on
Thecorrela1oncoefficientrisasta1s1cthatmeasuresthelinearassocia1onbetweentwovariablesX,andY
Descrip1vesta(s(csareusuallydenotedwithromancharacters
• Asinglequan1ta1vevariable:themean• Asinglecategoricalvariable:propor1on • Apairofquan1ta1vevariables:thecorrela1on
A variety of sta4s4cs than can be used to describe data
psocialr
Popula1on/processparametersareusuallydenotedwithgreekcharacters
• Asinglequan1ta1vevariable:themean• Asinglecategoricalvariable:propor1on • Apairofquan1ta1vevariables:thecorrela1on
Sta4s4cs have corresponding parameters
πsocial
ρ
(infinitedata)(infinitedata)
Sta4s4cal inference Weusesamplesta(s(cstomakejudgementsaboutpopula(onparameters
x
μ
π,μ,σ,ρ
p,x,s,r
Sta4s4cal inference
Es4ma4on
Pointes3ma3on:
xisapointes1mateforμ
μ x
Regression Sharka`
acks
Icecreamsales
Thebi'sareusuallychosentominimizetheMSE:
Inlinearregression,wetrytofitalinetoourdata
Truth:y=β0+β1·xFit:ŷ=b0+b1·x
Regression cau4on # 1 Avoidtryingtoapplytheregressionlinetopredictvaluesfarfromthosethatwereusedtocreatetheline
Sta4s4cs have distribu4ons too!
Asamplingdistribu3onisthedistribu1onofsamplesta1s1cscomputedfordifferentsamplesofthesamesizefromthesamepopula1onSamplingdistribu1onareomenapproximatelyNormal• Duetothecentrallimittheorem:sumsofmanyrandomvariatesleadtoaNormaldistribu1on
x1,x2,x3,…,xk
Es4ma4on
Pointes3ma3on:Intervales3mate:pointes1mate+marginoferror
xisapointes1mateforμ
xμ
Thepopula1onmeanissomewhereinhere
Confidence Intervals
Aconfidenceintervalforaparameterisanintervalcomputedbyamethodthatwillcapturetheparameteraspecifiedpropor(onof(mes• (ifthesamplingprocesswererepeatedmany1mes)
Thesuccessrate(propor1onofallsampleswhoseintervalscontaintheparameter)isknownastheconfidencelevel
Think ring toss…
Parameterexistsintheidealworld
Wetossintervalsatit
95%ofthoseintervalscapturetheparameter
(fora95%CI)
WitsandWagers…
The bootstrap to es4mate standard errors
Theplug-inprinciple
95% Confidence Intervals
Ifthebootstrapdistribu1onis~symmetricwecanusepercen1lestocalculatea95%confidenceinterval
Formulasalsoexistforcalcula1ngCIs• SEofthemeancanbees1matedas:• CIis:Sta0s0c±2·SE
Ques4on: why is a p-value?
Ap-valueistheprobabilityofgeungasta1s1casormoreextremethantheobservedsta1s1cfromanullmodel(nulldistribu1on)
Hypothesis tests in 2 steps
1.Createadistribu1onshowingwhatthesta1s1cwouldlikelikeiftherewasnoeffect(H0=true)• Nulldistribu1on(distribu1onifnoeffect)
2.Showthatthesta1s1cyouobservedisunlikelytocomefromthisnulldistribu1on• P-valueis:Pr(S≥obs_stat|H0)• P-valueisnot:Pr(H0|data)
Trial metaphor of hypothesis tes4ng
1.Assumeinnocence:H0istrue• StateH0andHA
2.Gatherevidence
• Calculatetheobservedsta1s1c3.Createadistribu1onofwhatevidencewouldlooklikeifH0istrue
• Nulldistribu1on
4.Assesstheprobabilitythattheobservedevidencewouldcomefromthenulldistribu1on
• p-value5.Makeajudgement
• Assesswhethertheresultsaresta1s1callysignificant
Types of hypothesis tests test
Permuta3ontest:Createnulldistribu1onsimula1ngmanyrandomassignments• 1.Shufflethecondi1onlabels• 2.Computesta1s1conshuffledlabels• 3.Repeatmany1mestogetanulldistribu1on
Parametrictests:basedonthefactthatdistribu1onofsta1s1cisknown• t-tests,ANOVAs,chi-squaredtests,etc...• Lessrobust(basedonNormalapproxima1ons)
Visual hypothesis tests
“Visualhypothesistests”createvisualplotsofrandomlyrearrangedpoints• i.e.,asetof‘innocent’plotswherethereisnopa`ern
Ifyoucantelltherealdatafromtheplotsofrandomlygenerateddatathisisequivalenttorejec1ngthenullhypothesis
Whichplotshowstherealdata?
Visual hypothesis tests
Visual hypothesis tests
SeeWickham,etal,2010
Permuta4on test example: Hypothesis tests for comparing two means
Ques3on:Isthispilleffec1ve?
Experimental design
Takeagroupofpar1cipantandrandomlyassign:• Halftoatreatmentgroupwheretheygetthepill
• Halfinacontrolgroupwheretheygetafakepill(placebo)• Seeifthereismoreimprovementinthetreatmentgroupcomparedtothecontrolgroup
Par1cipantpool
ControlgroupTreatmentgroup
RandomAssignment
Hypothesis tests for differences in two group means
1.Statethenullandalterna1vehypothesis• H0:μTreatment=μControlor μTreatment–μControl=0• HA:μTreatment>μControlor μTreatment–μControl>0
2.Calculatesta1s1cofinterest• xTreatment-xControl
3.Whatshouldwedonext…?
3. Create the null distribu4on!
Reconstructedpar1cipantpooldataunderH0
Shuffled‘treatmentgroup’ Shuffled‘controlgroup’
ShuffledataforrandomassignmentconsistentwithH0
ControlgroupTreatmentgroup
Onenulldistribu1onsta1s1c:xShuff_Treatment-xShuff_control
Repeat 10,000 4mes
p-value=.064
Step4?
Fisher vs. Neyman
Ques3on:shouldyoureportexactp-values?• p<.05• p=.021
Null-hypothesissignificancetes1ng(NHST)isahybridoftwotheories:
• 1.Significancetes1ngofRonaldFisher
• 2.Hypothesistes1ngofJezyNeymanandEgonPearson
Fisher(1890-1962) Neyman(1894-1981)
Neatfact:theyhatedeachother
Type I and Type II Errors
RejectH0 DonotrejectH0
H0istrue TypeIerror(α)(falseposi1ve)
Noerror
H0isfalse Noerror TypeIIerror(β)(falsenega1ve)
Do reproducible research!
Toolsexistthatallowyoutoembedyouranalysesintoyourwri`enreports:
• R:RMarkdown• Python/Julia/R:Jupyter• MATLAB:LiveScripts
John Tukey
“Farbe`eranapproximateanswertotherightques1on,whichisomenvague,thananexactanswertothewrongques1on”
-Thefutureofdataanalysis.AnnalsofMathema1calSta1s1cs33(1),(1962),page13.
What is Data Science?
SeeDonoho,2017
Ques4ons
Interac4ve single neuron analysis tutorial…
h`ps://neuraldata.net/spike/
CreatedbyBrookeFItzgerald
Are we feeling ok about hypothesis tests?
AmericanSta1s1calAssocia1on’sStatementonp-values