56
Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie Henderson and Tim Howard April 24, 2011 1 – 4PM, Pacific Time

Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Embed Size (px)

Citation preview

Page 1: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Species Habitat Modeling using randomForest in the R environment for statistical analysis

Nature Serve Biodiversity Without Boundaries ConferenceEmilie Henderson and Tim Howard

April 24, 20111 – 4PM, Pacific Time

Page 2: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Introduction

Page 3: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

What are Element Distribution Models?... Species Distribution Models?

• A prediction of the most suitable habitats for a species or natural community

• Based on known locations for that species AND known environmental conditions.

• Plot Data:– Forest Inventory and Analysis– Bureau of Land Management– USFS Current Vegetation Survey– USFS Ecology Plots– Element Occurrence Records (for rare

types)– supplementary points from airphoto

interpretation.

• Spatial Data:– LANDSAT (Bands and

transformations)– Climate– Soil parent material– Elevation (derivatives)– location

StatisticalModel Predictive Map

of Species Habitats

Page 4: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Which Algorithm?BIOCLIMDOMAINMAXENTGARPLogistic RegressionMahalanobis DistanceClassification and Regression Trees (CART)Multivariate Adaptive Regression Splines (MARS)Regression Tree AnalysisBagging TreesRandom Forests ** Generalized Additive Models (GAM)Neural Networks

See Elith et al. 2006, Prasad et al. 2006, Guisan and Zimmerman 2000

Page 5: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Random Forests

• Compare presence points with background points to create classification trees (CART) based on environmental layers.

• Randomly use a subset of the environmental layers and a subset of the points each time.

• Build hundreds of trees, compare results among trees.

Prasad et al. 2006

Breiman 2001

Page 6: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Indiana Bat, 300th RF treeStart Here

A

GDD ann

CaCO3

precip 7

Topoindex

solrad

precip 7P

A

P A

P

ann min T

A

elev

% OM

A precip 5

… …

P = present A = absent this is a CART tree

Page 7: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

forest of trees

- using random subsets of both environmental variables and attributed points

Page 8: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

A

predict on the forest of trees

P = present A = absent

P P

2 of 3 of trees predict ‘present’

Cell value = 67%

run GIS ‘cell’ down all trees (complete for each of 380M cells)

Page 9: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Interpretation of output

• Final output reports the probability that the environmental conditions match the conditions where the target element is known to occur.

• Both internal and external model validation, including confusion matrices, kappa and other metrics, ROC plots and AUC can be extracted.

Random Forest may find many solutions

Page 10: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Random Forest: What it can’t do.

• Build a perfect map.• The methods we’re using today aren’t

tuned for community-compositional analysis

Page 11: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Random Forest: What it can do well

• Highlight locations that are worth visiting to look for species X

• Highlight locations where re-introduction of Species X might be worthwhile

Page 12: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Why R?

Environmental Layers

Random Forest PackageRODBC package (data input/output)SDMap Draft Package (accuracy assessment and mapping)

.tif file of predictions

Attributed points

Database Tables

Page 13: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Why R?

• Things R can do:– Interact with databases– Pull information from gridded spatial data layers– Build random forest models– Characterize accuracy of random forest models– Build graphics to describe model accuracy– Build spatial grids of model predictions– Automate workflow– Save figures and build reports

Page 14: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

time

lea

rnin

g c

urv

e

0

5

10

15

20

25

30

35

0 10 20 30 40

A GH!!!

Why doesn’t everyone use R?

EmilieTim

A!!! ! !!!

Page 15: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Goals for today:

– Pre-R data management– Efficient data import and export to and from R

• shapefile attribute tables, Access databases, .csv files

– Basic R navigation • manipulating vectors and data frames

– Build a randomForest model for presence-absence data– Assess model accuracy.– Build a map from that model.– Script a loop for building many models and maps– Automate accuracy assessment to go with mapping.

Page 16: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

What do you know about R?

Page 17: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Data management

Page 18: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

A plot dataset (LANDFIRE Plot Reference Database)

Page 19: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Tables

Page 20: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

PLOT

Page 21: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

ENV

Page 22: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

SPP

For other types of data (e.g., Element Occurrence Records), a different structure may be needed. Tim’s Input Here?

Page 23: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Queries

Page 24: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Tim – Element Occurrence Records – database Structure

• First third of tim’s script

Page 25: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Spatial Data

• Questions to ask before building spatial data for modeling– What spatial resolution and extent are needed?– What types of information are likely to be useful?

• Types of spatial data that I use for modeling plants• Soil• Climate• Topography• Imagery

• Things to pay attention to when building spatial data– Consistency

• Grids MUST be of identical spatial extent, and snapped to the same upper-left coordinate

– Projection• Projections must be identical among grids

Page 26: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Organization of spatial data is key for efficient mapping workflow

Page 27: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Plot-spatial data intersection

– Sampling grain, modeling grain, and the intersection.

13

4 65

2elevation

slopeaspect

temperatureprecipitation

Landsat spectral reflectance – band 5

Page 28: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

• I’ve used Matt Gregory’s ‘footprint.exe’ program for extracting averages of multi-pixel plot footprints. (freely available)

This is a non-trivial endeavor

Page 29: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Footprint.exe

R6 5 5 30.00 0 1 0 00 1 1 1 01 1 2 1 10 1 1 1 00 0 1 0 0

R6_ECO 3 3 30.01 1 11 2 11 1 1

Single 3 3 30.00 0 00 2 00 0 0

• a handy little command-line tool for multi-pixel plot-grid intersections.

• Handles multiple plot formats

Page 30: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Question: Polygon to point?

Equal # per poly-bias for small

Equal # per area-bias for large

New YorkNatural Heritage Program

Page 31: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Decision: logistic samplingLogistic Sampling Scheme

# Pixels in Polygon

0 200 400 600 800 1000 1200 1400

# P

oin

ts to

Sa

mp

le

0

100

200

300

400

500

Asymptote

1 to 1 sampling

1

1

2kxe

AY

New YorkNatural Heritage Program

Page 32: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Getting Started:R navigation, and data import

Page 33: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Objects in R

• Some basic object types– Vector

• Data types– Numeric– Character– Factor – efficiently store repetitive charactor vectors with numbers. Useful, but can produce bugs if handled

improperly.

– Data frame• Holds multiple vectors together, can be of different formats

– List• Holds multiple objects of any type together.

Page 34: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Getting Data into R

• Basic:– read.table– read.csv– read.dbf

• (included in the ‘foreign’ library, and also the ‘shapefiles’ library – has slightly different formats for each)

• More useful:– RODBC can pull data from

• Access• Excel• SQL Server• ... and others

Page 35: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Within-R manipulations of data

• Subsetting – based on indexes (i.e., how to pull out column 4)– Based on true-false vectors (i.e., how to pull out plots that have been screened as ‘OK’)

• Xtabs – Taking a long species table, and making it into a crosstab format

Page 36: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Packages in R• Extensions are abundant, and generally well-tested by the time they

go public.

• New and better stuff comes together all the time.– Last November: raster package improved handling of raster maps in R.

Should lead to easier programming in the future.– MaxEnt has an implementation in R as well. Handy for comparison if you

want.

• Don’t rest under the illusion that the package I shared is the latest and greatest. – Teaching from our own code is simplest.– It’s good enough to be of use, especially to help you over the first bit of

the learning-curve.– Better code may be out there already.

Page 37: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Workshop – getting the data in to R – 30 minutes

• Time to: Set up your own dataset, or work from our sample dataset– Database setup

• Make sure you can find the ‘spp’, ‘env’, and ‘plot’ tables within your database.

• Create (or locate) a queries for spp and env that contain only plots within your area of interest

– Read spp (long form) and env database tables in to R• Set up a database connection using ‘odbcConnectAccess’• Use ‘sqlFetch’ to read queries from the database.

– Manipulate information into the proper format for modeling.• Use ‘xtabs’ to re-arrange species data into a crosstab table.• Create a true-false vector for a species of your choice.

Page 38: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

BREAK

Page 39: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Model Building• randomForest package:randomForest R Documentation

• Classification and Regression with Random Forest

• Description:

• ‘randomForest’ implements Breiman's random forest algorithm (based• on Breiman and Cutler's original Fortran code) for classification• and regression. It can also be used in unsupervised mode for• assessing proximities among data points.

• Usage:

• ## S3 method for class 'formula'• randomForest(formula, data=NULL, ..., subset, na.action=na.fail)• ## Default S3 method:• randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,• mtry=if (!is.null(y) && !is.factor(y))• max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),• replace=TRUE, classwt=NULL, cutoff, strata,• sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),• nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,• maxnodes = NULL,• importance=FALSE, localImp=FALSE, nPerm=1,• proximity, oob.prox=proximity,• norm.votes=TRUE, do.trace=FALSE,• keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,• keep.inbag=FALSE, ...)• ## S3 method for class 'randomForest'• print(x, ...)

Page 40: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Why select variables?

• Occam’s razor– Simplest model possible = ‘good science’.

• Processing time – More variables mean a more complicated model, means a

longer run-time.

• However:– randomForest is relatively robust to colinearity, so this step

isn’t as crucial as it is for other techniques.

Page 41: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Accuracy Assessment – 10 minutes

• Explain in abstract• Illustrate objects• Vloo• Output example

• Tim – Section #2

Page 42: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Types of errors

– False positives– False negatives

Page 43: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Consequences of types of errors for rare species maps.

– False positives: wasted time surveying for possible occurrences, lost credibility

– False negatives: Missed occurrences, if map informs development plans, and indicates no need for survey, then individuals may be at risk.

Page 44: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Choosing how to balance errors

– AUC analysis– Alpha and ROCR package

Page 45: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

R functions in SDMap

– GetBestCutoff– rocplot– PlotTherm– cv.rf– vloo

Page 46: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Workshop:Model building and accuracy assessment

Page 47: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Mapping

• R is not a GIS– although ESRI is now integrating more closely with R.

• R logistical constraints – holds everything in active memory (RAM limited)– This limits how many cells you can process at

once. – New Raster package, and mine process row by

row to get around this.

Page 48: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

The Map_RF function

Map_rf <-function(rfobj, template_path,

filepaths, method = c("response","class.prob"),

map.category = c(1), mname = "rfmap", startrow = 0, endrow = NA, restart = F, save.cycle = 60*60*6 )

Page 49: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

template_path

• Text file pointer to a masking file... Serves as the ‘template’ for the grids that you’ll create.

• Determines where you’re going to map.

Page 50: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

filepaths

• A table in R that tells the mapping function where to find the grids that correspond to your variables

Page 51: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

method = c("response","class.prob")

• In our case, we want probility predictions, so we’ll set method = “class.prob”

• The function automatically uses the first option listed, even if there isn’t a second one.

Page 52: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

map.category = c(1)

• Since we’re mapping one ‘probability’ layer, we’ll want to tell the function which class to map.

• For a binary RF model, this will either be “TRUE”, or it could be 1, depending on how you structured your input data.

• It’s possible to map multiple class probabilities too, if you’re mapping many categories, but that’s another mapping exercise.

Page 53: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

WorkshopBuilding a map

Page 54: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Reporting – 10 minutes

• Automated Accuracy Report generation – Tim – Code Section #3

• [ pretty complex – uses Sweave and Latex, but now that it is set up, it is pretty slick. ]

Page 55: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Production work – 5 minutes talking, 5 minutes workshop

• Looping in R – looping through a spreadsheet• Using logical flags• Saving Plots along the way• Making directories automatically to help with

organization.

Page 56: Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature Serve Biodiversity Without Boundaries Conference Emilie

Workshop -

• Script and save a very basic report.