Upload
galena
View
40
Download
3
Embed Size (px)
DESCRIPTION
Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013. R package “ randomForests ” Erick Towett. Welcome. Outline Introduction Usage total element composition of Africa soils using total X-ray fluorescence ( TXRF). - PowerPoint PPT Presentation
Citation preview
Hands-on Soil Infrared Spectroscopy Training Course
Getting the best out of light11 – 15 November 2013
R package “randomForests” Erick Towett
2
Welcome
Outline• Introduction
• Usage• total element composition of Africa soils using total X-ray fluorescence
(TXRF).• combining MIR and TXRF for the prediction of soil properties.• MIRS randomForests prediction models for soil properties.
• Demo application of RF to MIRS calibration.
3
• “randomForest” (RF) implements Breiman’s random forest algorithm for classification and regression based on a forest of trees using random inputs.
• Version 4.6-7
• Depends R (>= 2.5.0)
• Description: Classification and regression based on a forest of trees using random inputs. URL http://stat-www.berkeley.edu/users/breiman/RandomForests Reference: Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
Introduction I
4
RF is fast and easy to implement, produce highly accurate predictions
It runs efficiently on large data bases. It can handle thousands of input variables without variable
deletion and without overfitting. It gives estimates of variable importance in the classification. RF handles complex data types well. Obviates the need for transformation of predictors to
approximate normal distributions.
Features of Random Forests
5
What are the challenges of RF?X There are many possible alternative nodes; X reseeding will give different models.
How does RF work?
• The out-of-bag (oob) error estimate In RF, each tree is constructed using a different bootstrap sample from the
original data. ~ 1/3 of the cases are left out of the bootstrap sample and not used in the
construction of the kth tree. Data to get a running unbiased estimate of classification error as trees are
added to the forest. It is used to get estimates of variable importance.
Features of RF
6
• RF can output a list of predictor variables that are important in predicting the outcome.
• The randomForest package in R has two measures of importance. One is "total decrease in node impurities from splitting on the variable,
averaged over all trees.” The other is based on a permutation test.
How does RF work?
7
Study 1:Variability and patterns in total element composition of sub-Saharan Africa (SSA) soils using TXRF.
The objectives were to; 1. quantify the variability in total element
composition of soils from a diverse set of soils across SSA using TXRF, and
2. explore the patterns in total element composition of soils analysed.
Usage
8
Materials and Methods
• Soils from 34 randomly-located 100-km2 sentinel sites across Africa.
Consistent field protocolSoil spectroscopy
Sentinel sites Randomized sampling schemes
• LDSF = a hierarchical spatially stratified random sampling scheme with ten 100 m2 plots nested within sixteen 1 km2 clusters, nested within 100 km2 sites.
Land degradation surveillance framework (LDSF)
10
Materials and Methods
• Soil samples collected at two depths, 0-20 & 20-50 cm.
• Total of 1074 samples (16 samples per cluster x 2 soil depths x 34 sentinel sites) used for
exploring spectral (TXRF) patterns.
• Total element conc. for 17 elements; • Al, P, K, Ca, Ti, V, Cr, Mn, Fe, Ni, Cu, Zn, Ga, Sr, Y, Ta, & Pb.
11
Materials and Methods
• PCA on the TXRF data
• RF regression of factors vs the first 5 PCs of the TXRF element conc.• to confirm whether site or soil-forming factors (e.g., mineralogy, climate,
topography & vegetation) are important drivers of total elemental conc. in the soil
• to view the importance of the predictor variables.
• Site factors extracted for each site from LDSF database & Worldclim data & mineralogy data from XRD analysis
• raw semi-quantitative mineralogy data & dominant mineralogy grouping.
12
• Total element conc. values were within the range reported globally for soil Cr, Mn, Zn, Ni, V, Sr, & Y and in the high range for Al, Cu, Ta, Pb, & Ga.
Values compiled from this study (mg kg-1)
Reported mean and ranges of background contents of elements in crust and worldwide soils (mg kg-1)
Element Mean Range Worldwide ranges
Crustal Average
Worldwide mean
Median values Ghana soil
Al 33927 94 - 89068 10000-40000 - - -
P 143 25 – 2358 - - - -
K 10893 291 - 77898 - - - -
Ca 9780 82 - 426431 - - - -
Ti 4264 2.6 - 25611 200-24000 4400 - -
V 37 0.7 - 393 5.0-500 135 60 -
Cr 64 0.7 - 598 1-1500 100 42 72
Mn 466 1.6 - 6575 <7->9000 900 418 -
Fe 27954 20 - 181691 1000-550000 - - -
Ni 19 0.3 - 364 0.2-500 20 18 39
Cu 17 0.3 - 114 1.0-250 55 14 17-29
Zn 29 0.3 - 138 10-602 70 62 45-47
Ga 8 0.2 - 31 0.4-70 15 1.2 -
Sr 118 1.2 - 1985 32->1000 375 147 -
Y 13 0.2 - 109 16-33 33 12 -
Ta 3 0.1 - 16 0.8-5.3 2.0 1.1 -
Pb 37 0.3 - 638 2.0-16338 14 25 18-22
Results
13
• Significant variations (P < 0.05) in total element composition within & between the sites for the 17 elements analysed.
• Greatest proportion of total variance & number of significant variance components occurred at the site (55-88%) followed by the cluster nested within site levels (10-40%).
Element
nSite Site*Cluster Site*Depth Depth Residual
Estimate %Tot var Estimate %Tot
var Estimate %Tot var Estimate %Tot
var Estimate %Tot var
Al1068 0.966
880.112
100.004
0.40.005
0.450.016
1.4
P 1059 0.718
760.198
210.002
0.21.4*10-21 <0.01
0.0252.6
K1065 0.913
710.354
280.003
0.26.8*10-21 <0.01
0.0100.8
Ca1068 2.186
790.480
170.034
1.20.017
0.600.051
1.8
Ti1067 1.398
870.199
120.001
0.10.001
0.040.014
0.9
V1067 1.463
770.379
200.009
0.50.008
0.390.053
2.8
Cr1068 0.808
650.384
310.005
0.40.006
0.460.039
3.2
Mn1067 1.007
680.393
270.023
1.60.008
0.510.040
2.7
Fe1066 1.459
800.335
180.005
0.30.009
0.470.026
1.4
Results
14
•PCA revealed that patterns in total element conc. between sites appeared to relate to differences in mineralogical ‘functional groups’ .
• The pattern of clustering of the individual minerals and sorting of heavy minerals (V, Pb, Ni, Cr, Cu Ti, and Fe) along the positive Dim1 axis is apparent.
Biplots (arrow sizes are proportional to the “initial” variability in the elements present) based on the principal component Dim 1 vs Dim 2 and Dim 1 and Dim 3, on the log transformed data of the soil total element concentration from all sites analysed.
Results
15
• Strong observed within site & between site variations in many elements can serve to diagnose of soil fertility potential.
• Elements clustered out differently in the different sample sets from different sentinel sites, indicating a wide variation in associations.
• some elements are poorly represented (short arrows in the PCA).
Biplots based on PCA of element concentration for 4 sentinel sites.
Results
16
Results
•RF model performances were acceptable with R2>0.75.
•Most important variables = cluster, topography, landuse, precipitation and temperature,
• The importance of cluster explained by spatial correlation at distances of < 1 km.
(a)
(b)
Dim 1 (R2=0.92, rmsep=0.47) Dim 2 (R2=0.84, rmsep=0.40) Dim 3 (R2=0.79, rmsep=0.37)
Dim 1 (R2=0.90, rmsep=0.52) Dim 2 (R2=0.80, rmsep=0.51) Dim 3 (R2=0.75, rmsep=0.41)
Variable importance plots showing the model accuracies & mean decrease in accuracy (%IncMSE) of the Random Forests regression of TXRF element concs against mineralogy + site/soil-forming factors (a) including cluster and (b) excluding cluster.
17
Study 2:
Potential of combining MIR & TXRF spectroscopy for the prediction of soil properties
Objectives: to evaluate whether TXRF can complement MIR for predicting soil test values,
especially for tests that are poorly predicted by MIR (e.g. extractable P and K; and some micronutrients).
Usage
18
Materials and Methods
• Georeferenced soil samples associated with the AfSIS Project.
A total of 700 soil samples 44 random 100-km2 sentinel sites, stratified according to Köppen-Geiger climatic zones distributed across SSA.
• Samples were analysed using MIR spectrometer.
19
Fourier-Transform MIR spectrometer
• Infrared absorbance spectra were recorded at 4 cm-1 intervals in the range of 400 to 4000 cm-1.
• The average of the spectra for 4 replicates was taken.
• TXRF methodology for total elemental concentrations in each soil sample.
TXRF spectrometer
Materials and Methods
• RF-OOB calibration models developed (n= 700). to predict the reference properties from the TXRF total element
composition using the raw total element concentration data as ‘spectra’.
• Raw TXRF spectra in conjunction with 1st derivative MIR spectra to predict the reference soil properties.
• RF used to calibrate the residuals of the predictions from the MIR spectral data to the raw TXRF total element data as mixing different data types in the predictor variables might affect the
variable importance weights in the fitted models.
20
Materials and Methods
21
Results
•MIR spectra resulted in very good prediction models using RF out-of-bag validation (R2 > 0.80) for
organic C and N, total C and N, exchangeable Ca, Mehlich-3 Al and pH.
• Also predicted well (R2 > 0.60) were
Ca/Mg ratio, exchangeable bases, exchangeable Mg, phosphorus sorption index (PSI) water- and calgon-dispersed
particles analysed by laser diffraction for sand content, clay content, and silt content.
22
Results
• Calibration models were not satisfactory (R2<0.60) Mehlich-3 extractable K, Mn, Fe, Cu, B, Zn, P, S, and Na, exchangeable acidity, electrical conductivity (ECd), exchangeable sodium percentage (ESP), exchangeable sodium ratio (ESR), air-dispersed particles for silt content, clay content and sand contents.
23
Results
•RF was able to improve prediction accuracies if the raw TXRF spectra was added to the MIR data.
e.g. ECd (63% reduction in rmse), Mehlich-3 S (54), exchangeable Na (53%), ESP (50%), ESR (50%), total C (29%), Mehlich-3 B (28%), Mehlich-3 Mn (26%), exchangeable Mg (17%), Mehlich-3 Cu (15%), Mehlich-3 Fe (11%), organic C (10%), Mehlich-3 Zn (6%), and silt content (8-50 microns) air-dispersed particles by laser diffraction (4%)). The improvement in the predictions was mostly due to TXRF detecting a few
outlier samples that were different from the rest of the samples.
•TXRF data used as a predictor did not add value to MIR beyond identifying outlying samples,
these could not be detected as MIR spectral outliers hence TXRF may be used as an outlier detector.
22
24
Study 3:
• Analysis of MIRS randomForests prediction models for soil properties.
• Ongoing study attempt to offer an in-depth analysis of random forests models for the
prediction of a number of soil properties using MIR spectroscopy.
Usage
25
Materials and Methods
• 1907 soil samples scanned through MIR spectrometer at a resolution of 4 cm-1 .
• 1st derivative of the spectral range 601.7-4001.6 cm-1 calculated smoothing interval of 21 data points using the soil.spec package in R.
• RF-OOB built to predict the reference properties from the MIRS 1st derivative spectra using the entire data set.
26
Preliminary Results
27
Demo:
R package “randomForests”
28
R package “randomForests”
Thank you for your attention