23
Pennsylvania Predictive Model Set: Realigning Old Expectations with New Techniques in the Creation of a Statewide Archaeological Sensitivity Model Matthew D. Harris, AECOM Grace H. Ziesing, AECOM SAA 2015, San Francisco, CA.

Society for American Archaeology - 2015

  • Upload
    mrecos

  • View
    1.166

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Society for American Archaeology - 2015

Pennsylvania Predictive Model Set:Realigning Old Expectations with New Techniques in the Creation of a Statewide Archaeological Sensitivity Model

Matthew D. Harris, AECOMGrace H. Ziesing, AECOM SAA 2015, San Francisco,

CA.

Page 2: Society for American Archaeology - 2015

Settlement

StudiesCultural Ecology

Optimal Foraging

Systems Theory

“New” Archaeolo

gy

Predictive

Modeling

1950’s 1960’s 1970’s 1980’s

American Antiquity

Page 3: Society for American Archaeology - 2015

Many goods models, many more bad ones…

Clear goals and model intentions

Iterative learning algorithms for pattern detection

Empirical error estimate through resampling

Letting data speak for itself…𝑪𝒐𝒏𝒄𝒆𝒑𝒕𝒖𝒂𝒍 𝑴𝒐𝒅𝒆𝒍𝒀=𝑭 (𝑿 )+𝜺

Machine Learning Approach

Page 4: Society for American Archaeology - 2015

86 5 0

Count of Native American Sites per 1-km

Page 5: Society for American Archaeology - 2015

Sample Generating Process:Non-systematicSubjectiveExtensive measurement errorNon-representativeSpatially biased

Population Generating Process:Highly dynamicNon-mechanisticNon-stationaryCultural and agencyHighly dynamic environmentChanging parametersSubjectively defined expressionClustered

𝑛≈𝑁 /0.01

𝑁

𝑛≈18,200

Page 6: Society for American Archaeology - 2015

Project ScalablePrimary constraint is time Secondary: computer resourcesRaster outputExpectations are broad and undefined

Dataset Very low prevalence highly imbalancedHigh false-negative cost vs. low false-positive cost

Covariates Primarily environmentalCo-correlatedUnrepresentativeLimited class separation

Academic DomainScant theoretical framework General lack of validationNo agreed upon benchmarks, or methods

000

𝑝 ≈93

𝑡≈18𝑚𝑜

P(A|B)

Page 7: Society for American Archaeology - 2015

Under fit Over fit

(d.f., parameters, variables)

Page 8: Society for American Archaeology - 2015

Key Takeaways – if you hear nothing after this point:

• Not black box – measure twice cut once

• Randomize, Resample, Retest

• Understand model complexity & Bias vs. Variance

• Know your metrics. (AUC, Kvamme Gain, AIC/BIC,

Accuracy)

• BALANCE in all ways; no one right answer

• Class Thresholds are critical and not arbitrary

• Cloud based, Backup, practice #openscience

• Learn to code. (Excessive ArcGIS will give you hairy

palms)

Page 9: Society for American Archaeology - 2015

X2

X1

𝑦= 𝑓 ( 𝑋 )=𝛽𝑜+∑𝑚− 1

𝑀

𝛽𝑚h𝑚 (𝑋 )𝑡1

𝑡 2𝑡 3

h1h2

Backwards Stepwise Logistic Regression

Generalized Linear Model w/ binomial linkLower Complexity = high bias v. low varianceTraditional in archaeologyParameters: Estimating model coefficients (MLE)Variable selection: backwards stepwise for AIC

Multivariate Adaptive Regression Splines

Special case of the Generalized Linear ModelModerate Complexity = variable bias and varianceUnknown to Archaeology, used in EcologyParameters: nprune – recursive pruning of termsVariable selection: Generalized Cross-Validation

Models

Page 10: Society for American Archaeology - 2015

𝑡𝑟𝑒𝑒1 𝑡𝑟𝑒𝑒2 𝑡𝑟𝑒𝑒𝑏…𝒙 𝒙 𝒙

�̂�1=𝐶𝑎𝑡 �̂�2=𝐶𝑎𝑡 �̂�𝑏=𝐷𝑜𝑔

�̂�=𝑪𝒂𝒕

Single Binary Classification TreeNode-Branch StructureNode Split Function based on Gini IndexBinary Classification or probability Generally high variance

Random ForestBootstrap-aggregating – “Bagging”Out-of-bag – Unbiased error estimationVariable Randomization – mtry parameterIncorporates weights and leaf node sizeSparse examples in archaeology, common elsewhere

Page 11: Society for American Archaeology - 2015

C D

𝑥1

C D

𝑥2

C D

𝑥3

C D

𝑥𝑖. . .

𝑀 1

𝐹𝑀 (𝑋 ) ❑

C D

𝑥1

C D

𝑥2

C D

𝑥3

C D

𝑥𝑖. . .

𝑀 2

C D

𝑥1

C D

𝑥2

C D

𝑥3

C D

𝑥𝑖. . .

𝑀 𝑖. . .

𝝍 (𝑦 𝑖 , h𝑝 (𝑥 ))→

𝝍 (𝑦 𝑖 , h𝑝 (𝑥 ))→

Gradient BoostingWeak Learners -> Strong LearnerDecision Tree stumpsEach iteration fit to residualsAdjustable learning rate

Page 12: Society for American Archaeology - 2015

𝑪𝒐𝒏𝒄𝒆𝒑𝒕𝒖𝒂𝒍 𝑴𝒐𝒅𝒆𝒍

Linear Regression𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛

𝑅𝑎𝑛𝑑𝑜𝑚𝐹𝑜𝑟𝑒𝑠𝑡

𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝐵𝑜𝑜𝑠𝑡𝑖𝑛𝑔

Page 13: Society for American Archaeology - 2015

𝑯𝒊𝒈𝒉 𝑴𝒐𝒅 . 𝑳𝒐𝒘

Page 14: Society for American Archaeology - 2015

𝑯𝒊𝒈𝒉 𝑴𝒐𝒅 . 𝑳𝒐𝒘

Page 15: Society for American Archaeology - 2015

Known SitesPresent Absent

Model Prediction

Present 1,992,770 309,213,157 311,205,927

Absent 31,472 747,684,746 747,716,218

2,024,242 1,056,897,903 1,058,922,145

Sensitivity / TPR = 98.4%Specificity / TNR = 70.7%

Prevalence = 0.0019 Kvamme Gain (Kg) = 0.701

Accuracy = 70.8%

Positive Prediction Gain (PPG) = 3.350Negative Prediction Gain (NPG) = 0.022

Detection Prevalence = 0.294Mean RMSE of hold-out sample = 0.181

Total model area (sq. mi) 45,293 Individual models 528 Total model cells (10 x 10 m) 1 Billion

Environmental variables 93

Site-present cells 2 Million Processed cells 102 BillionArchaeological sites 18,226 Data ~ 12 TB

Page 16: Society for American Archaeology - 2015
Page 17: Society for American Archaeology - 2015

Mars RF

LogReg GBM

Page 18: Society for American Archaeology - 2015
Page 19: Society for American Archaeology - 2015

Mars RF

LogReg GBM

Page 20: Society for American Archaeology - 2015
Page 21: Society for American Archaeology - 2015
Page 22: Society for American Archaeology - 2015

Prediction Err Fraction Gain/Balance KG

All Models 26.8% 65% 0.519 0.59

Winning Models 18.3% 74% 0.737 0.63

Prediction Err Fraction Gain/BalanceBest Model 3.1% 84.8% 0.963

Worse Model 46% 58.5% 0.268

MARS GBM RF LogReg

Prediction Err 0 19 6 5

Fraction 0 16 12 2

Gain/Balance 0 19 8 3

Total 0 54 26 10

Count of “Winning” Models by each Metric

Improvement of “Winning” Models by each Metric

Range of “Winning” Models by each Metric

Page 23: Society for American Archaeology - 2015

• In these samples, sites are not distributed randomly relative to environment; a pattern exists.• … therefore, predictive modeling “works”.• … If bias site sample contains pattern.

• Data cleaning and preparation is as important as models.

• Iterative learning identifies patterns with various levels of bias and variance.• It’s critical to know your balance.

• Parametrization within CV to find candidate models that work.

• Repeated resampling approximates probability distribution of each sample. (Bayesian discussion/rant goes here.)

• Learn R (or python), practice #openscience, use R Studio and server

@md_harrisThank You!