Species Habitat Modeling using randomForest in the R environment for statistical analysis Nature...

Preview:

Citation preview

Species Habitat Modeling using randomForest in the R environment for statistical analysis

Nature Serve Biodiversity Without Boundaries ConferenceEmilie Henderson and Tim Howard

April 24, 20111 – 4PM, Pacific Time

Introduction

What are Element Distribution Models?... Species Distribution Models?

• A prediction of the most suitable habitats for a species or natural community

• Based on known locations for that species AND known environmental conditions.

• Plot Data:– Forest Inventory and Analysis– Bureau of Land Management– USFS Current Vegetation Survey– USFS Ecology Plots– Element Occurrence Records (for rare

types)– supplementary points from airphoto

interpretation.

• Spatial Data:– LANDSAT (Bands and

transformations)– Climate– Soil parent material– Elevation (derivatives)– location

StatisticalModel Predictive Map

of Species Habitats

Which Algorithm?BIOCLIMDOMAINMAXENTGARPLogistic RegressionMahalanobis DistanceClassification and Regression Trees (CART)Multivariate Adaptive Regression Splines (MARS)Regression Tree AnalysisBagging TreesRandom Forests ** Generalized Additive Models (GAM)Neural Networks

See Elith et al. 2006, Prasad et al. 2006, Guisan and Zimmerman 2000

Random Forests

• Compare presence points with background points to create classification trees (CART) based on environmental layers.

• Randomly use a subset of the environmental layers and a subset of the points each time.

• Build hundreds of trees, compare results among trees.

Prasad et al. 2006

Breiman 2001

Indiana Bat, 300th RF treeStart Here

A

GDD ann

CaCO3

precip 7

Topoindex

solrad

precip 7P

A

P A

P

ann min T

A

elev

% OM

A precip 5

… …

P = present A = absent this is a CART tree

forest of trees

- using random subsets of both environmental variables and attributed points

A

predict on the forest of trees

P = present A = absent

P P

2 of 3 of trees predict ‘present’

Cell value = 67%

run GIS ‘cell’ down all trees (complete for each of 380M cells)

Interpretation of output

• Final output reports the probability that the environmental conditions match the conditions where the target element is known to occur.

• Both internal and external model validation, including confusion matrices, kappa and other metrics, ROC plots and AUC can be extracted.

Random Forest may find many solutions

Random Forest: What it can’t do.

• Build a perfect map.• The methods we’re using today aren’t

tuned for community-compositional analysis

Random Forest: What it can do well

• Highlight locations that are worth visiting to look for species X

• Highlight locations where re-introduction of Species X might be worthwhile

Why R?

Environmental Layers

Random Forest PackageRODBC package (data input/output)SDMap Draft Package (accuracy assessment and mapping)

.tif file of predictions

Attributed points

Database Tables

Why R?

• Things R can do:– Interact with databases– Pull information from gridded spatial data layers– Build random forest models– Characterize accuracy of random forest models– Build graphics to describe model accuracy– Build spatial grids of model predictions– Automate workflow– Save figures and build reports

time

lea

rnin

g c

urv

e

0

5

10

15

20

25

30

35

0 10 20 30 40

A GH!!!

Why doesn’t everyone use R?

EmilieTim

A!!! ! !!!

Goals for today:

– Pre-R data management– Efficient data import and export to and from R

• shapefile attribute tables, Access databases, .csv files

– Basic R navigation • manipulating vectors and data frames

– Build a randomForest model for presence-absence data– Assess model accuracy.– Build a map from that model.– Script a loop for building many models and maps– Automate accuracy assessment to go with mapping.

What do you know about R?

Data management

A plot dataset (LANDFIRE Plot Reference Database)

Tables

PLOT

ENV

SPP

For other types of data (e.g., Element Occurrence Records), a different structure may be needed. Tim’s Input Here?

Queries

Tim – Element Occurrence Records – database Structure

• First third of tim’s script

Spatial Data

• Questions to ask before building spatial data for modeling– What spatial resolution and extent are needed?– What types of information are likely to be useful?

• Types of spatial data that I use for modeling plants• Soil• Climate• Topography• Imagery

• Things to pay attention to when building spatial data– Consistency

• Grids MUST be of identical spatial extent, and snapped to the same upper-left coordinate

– Projection• Projections must be identical among grids

Organization of spatial data is key for efficient mapping workflow

Plot-spatial data intersection

– Sampling grain, modeling grain, and the intersection.

13

4 65

2elevation

slopeaspect

temperatureprecipitation

Landsat spectral reflectance – band 5

• I’ve used Matt Gregory’s ‘footprint.exe’ program for extracting averages of multi-pixel plot footprints. (freely available)

This is a non-trivial endeavor

Footprint.exe

R6 5 5 30.00 0 1 0 00 1 1 1 01 1 2 1 10 1 1 1 00 0 1 0 0

R6_ECO 3 3 30.01 1 11 2 11 1 1

Single 3 3 30.00 0 00 2 00 0 0

• a handy little command-line tool for multi-pixel plot-grid intersections.

• Handles multiple plot formats

Question: Polygon to point?

Equal # per poly-bias for small

Equal # per area-bias for large

New YorkNatural Heritage Program

Decision: logistic samplingLogistic Sampling Scheme

# Pixels in Polygon

0 200 400 600 800 1000 1200 1400

# P

oin

ts to

Sa

mp

le

0

100

200

300

400

500

Asymptote

1 to 1 sampling

1

1

2kxe

AY

New YorkNatural Heritage Program

Getting Started:R navigation, and data import

Objects in R

• Some basic object types– Vector

• Data types– Numeric– Character– Factor – efficiently store repetitive charactor vectors with numbers. Useful, but can produce bugs if handled

improperly.

– Data frame• Holds multiple vectors together, can be of different formats

– List• Holds multiple objects of any type together.

Getting Data into R

• Basic:– read.table– read.csv– read.dbf

• (included in the ‘foreign’ library, and also the ‘shapefiles’ library – has slightly different formats for each)

• More useful:– RODBC can pull data from

• Access• Excel• SQL Server• ... and others

Within-R manipulations of data

• Subsetting – based on indexes (i.e., how to pull out column 4)– Based on true-false vectors (i.e., how to pull out plots that have been screened as ‘OK’)

• Xtabs – Taking a long species table, and making it into a crosstab format

Packages in R• Extensions are abundant, and generally well-tested by the time they

go public.

• New and better stuff comes together all the time.– Last November: raster package improved handling of raster maps in R.

Should lead to easier programming in the future.– MaxEnt has an implementation in R as well. Handy for comparison if you

want.

• Don’t rest under the illusion that the package I shared is the latest and greatest. – Teaching from our own code is simplest.– It’s good enough to be of use, especially to help you over the first bit of

the learning-curve.– Better code may be out there already.

Workshop – getting the data in to R – 30 minutes

• Time to: Set up your own dataset, or work from our sample dataset– Database setup

• Make sure you can find the ‘spp’, ‘env’, and ‘plot’ tables within your database.

• Create (or locate) a queries for spp and env that contain only plots within your area of interest

– Read spp (long form) and env database tables in to R• Set up a database connection using ‘odbcConnectAccess’• Use ‘sqlFetch’ to read queries from the database.

– Manipulate information into the proper format for modeling.• Use ‘xtabs’ to re-arrange species data into a crosstab table.• Create a true-false vector for a species of your choice.

BREAK

Model Building• randomForest package:randomForest R Documentation

• Classification and Regression with Random Forest

• Description:

• ‘randomForest’ implements Breiman's random forest algorithm (based• on Breiman and Cutler's original Fortran code) for classification• and regression. It can also be used in unsupervised mode for• assessing proximities among data points.

• Usage:

• ## S3 method for class 'formula'• randomForest(formula, data=NULL, ..., subset, na.action=na.fail)• ## Default S3 method:• randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,• mtry=if (!is.null(y) && !is.factor(y))• max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),• replace=TRUE, classwt=NULL, cutoff, strata,• sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),• nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,• maxnodes = NULL,• importance=FALSE, localImp=FALSE, nPerm=1,• proximity, oob.prox=proximity,• norm.votes=TRUE, do.trace=FALSE,• keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,• keep.inbag=FALSE, ...)• ## S3 method for class 'randomForest'• print(x, ...)

Why select variables?

• Occam’s razor– Simplest model possible = ‘good science’.

• Processing time – More variables mean a more complicated model, means a

longer run-time.

• However:– randomForest is relatively robust to colinearity, so this step

isn’t as crucial as it is for other techniques.

Accuracy Assessment – 10 minutes

• Explain in abstract• Illustrate objects• Vloo• Output example

• Tim – Section #2

Types of errors

– False positives– False negatives

Consequences of types of errors for rare species maps.

– False positives: wasted time surveying for possible occurrences, lost credibility

– False negatives: Missed occurrences, if map informs development plans, and indicates no need for survey, then individuals may be at risk.

Choosing how to balance errors

– AUC analysis– Alpha and ROCR package

R functions in SDMap

– GetBestCutoff– rocplot– PlotTherm– cv.rf– vloo

Workshop:Model building and accuracy assessment

Mapping

• R is not a GIS– although ESRI is now integrating more closely with R.

• R logistical constraints – holds everything in active memory (RAM limited)– This limits how many cells you can process at

once. – New Raster package, and mine process row by

row to get around this.

The Map_RF function

Map_rf <-function(rfobj, template_path,

filepaths, method = c("response","class.prob"),

map.category = c(1), mname = "rfmap", startrow = 0, endrow = NA, restart = F, save.cycle = 60*60*6 )

template_path

• Text file pointer to a masking file... Serves as the ‘template’ for the grids that you’ll create.

• Determines where you’re going to map.

filepaths

• A table in R that tells the mapping function where to find the grids that correspond to your variables

method = c("response","class.prob")

• In our case, we want probility predictions, so we’ll set method = “class.prob”

• The function automatically uses the first option listed, even if there isn’t a second one.

map.category = c(1)

• Since we’re mapping one ‘probability’ layer, we’ll want to tell the function which class to map.

• For a binary RF model, this will either be “TRUE”, or it could be 1, depending on how you structured your input data.

• It’s possible to map multiple class probabilities too, if you’re mapping many categories, but that’s another mapping exercise.

WorkshopBuilding a map

Reporting – 10 minutes

• Automated Accuracy Report generation – Tim – Code Section #3

• [ pretty complex – uses Sweave and Latex, but now that it is set up, it is pretty slick. ]

Production work – 5 minutes talking, 5 minutes workshop

• Looping in R – looping through a spreadsheet• Using logical flags• Saving Plots along the way• Making directories automatically to help with

organization.

Workshop -

• Script and save a very basic report.