iNEXT: an r package for interpolation and extrapolation species diversity

iNEXT : an R package for interpolation and

extrapolation species diversity種類數的稀釋與預測方法

謝宗震 (Johnson)

About me清華統計所

Taiwan R User Group Officer

Data Science Program 籌備委員

R 相關作品：

·

研究領域：Statistics, ecology and

genetics

-

·

·

·

R package: CARE1 [主要作者], iNEXT

[主要作者], ChaoEntropy, ChaoSpecies

Shiny app: iNEXT-Online [主要作者], LoL

Champion

-

-

2/15

https://www.facebook.com/Tw.R.User

http://datasci.co/

http://cran.r-project.org/web/packages/CARE1/index.html

http://johnsonhsieh.github.io/iNEXT/

http://johnsonhsieh.github.io/ChaoEntropy

http://johnsonhsieh.github.io/ChaoSpecies

http://glimmer.rstudio.com/tchsieh/inext/

http://glimmer.rstudio.com/wush978/LOLChampion/

關於集合的種類數

1. 植物學家想要知道一座森林有多少種樹木

2. 程式設計師想要知道軟體中有多少種臭蟲

3. 衛生單位想要知道某地發生多少種流行病

4. 文學家想要知道國學大師了解多少種字彙

3/15

實例1：曹雪芹懂得多少種字彙以撰寫經典名著紅樓夢的作者曹雪芹先生為例，取紅樓夢前80回字彙資料(李蕙帆 2008) 為例，試問曹

雪芹懂得多少字彙？

1 2 3 4 5 6 7 8 9 10 10UP

743 394 245 190 144 127 115 104 90 81 1182

透過抽樣，得到樣本中種類出現頻率以及觀察到的種類數，並滿足關係式

趙連菊教授 (Chao 1984) 推導出

i

fi

, i = 1, 2, . . . , nfi Sobs

S = + = +∑i=1

n

fi f0 Sobs f0

= + = +S Sobs f0 Sobs

f 21

2f2

4/15

實例1：曹雪芹懂得多少種字彙（續）

tab <- cbind(1:11, c(743, 394, 245, 190, 144, 127, 115, 104, 90, 81, 1182))

Sobs <- sum(tab[, 2])

f1 <- tab[1, 2]

f2 <- tab[2, 2]

f0.hat <- f12/2/f2

round(cbind(f0.hat, f1, f2, Sobs, Shat = Sobs + f0.hat))

## f0.hat f1 f2 Sobs Shat

## [1,] 701 743 394 3415 4116

5/15

抽樣與種類數的關係

a. 被觀察到的種類數和樣本大小有關

b. 增加抽樣成本能否得到對應的回報

c. 需要多少的樣本才能大致代表母體

6/15

統計方法 — 種類數的稀釋與預測利用樣本數對種類數的稀釋與預測曲線 (rarefaction and extrapolation curve)，來描述樣本數為時，

資料中出現的種類數

m

S(m)

7/15

統計方法 — 種類數的稀釋與預測（續）

稀釋與預測函數的期望值 (Good 1953)

統計學家 (Smith and Grassle 1977, Shen et al. 2003) 得到估計量

S(m) = [1 − (1 − ]∑i=1

S

pi )m

(m) = − if m ≤ nS Sobs ∑>0xi

( )n−xi

m

( )nm

(m) = + [1 − (1 − )] if m > nS Sobs f02f2

nf1

8/15

R套件 iNEXTR package iNEXT = mehtod of iNTerpolation and EXTrapolation curve

install.packages("devtools")

library(devtools)

install_github("iNEXT", "JohnsonHsieh")

library(iNEXT)

9/15

https://github.com/JohnsonHsieh/iNEXT

案例2：傳染病監測數據採用衛生福利部疾病管制署所提供的2013年傳染疾病監測數據作為例子，以台灣地區第1-20週法定傳

染病累計確認病進行傳染病的稀釋與預測分析。

10/15

http://www.cdc.gov.tw/

案例2：傳染病監測數據（續）

library(iNEXT)

dat <- read.csv(url("http://dl.dropboxusercontent.com/u/26949459/exmaple2.csv",

encoding = "big5"), row.names = 1)

Sobs <- apply(dat, 2, function(x) sum(x > 0))

n <- apply(dat, 2, function(x) sum(x))

out <- iNEXT(dat$week20, datatype = "abundance", end = 15000)

par(lwd = 2, pch = 19, cex = 1.3, family = "STHeiti")

plot.iNEXT(out, main = "Rarefaction/extrapolation at week20", ylab = "傳染病數目",

xlab = "確認病患數目")

points(n, Sobs, col = 2, pch = 4, cex = 1.5, lwd = 2)

text(n, Sobs, colnames(dat), col = 2, pos = 1, cex = 1)

out$summary

## n S.obs S.hat C.hat f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

## 6210 35 35.9 0.9995 3 5 1 2 1 2 4 0 0 0

11/15

案例2：傳染病監測數據（續）以1-20週為參考樣本的稀釋與預測曲線（黑

線）和真實觀察數據（紅色X）趨勢相似

準確預測出第30的結果，第44週略微低估

從樣本數6000增加至15000時，額外出現的傳

染病數目只增加0.89

·

·

·

12/15

案例3：英雄聯盟對戰數據資料取自線上資料庫英雄聯盟戰績網召喚師在遊戲中獲勝過的場次使用的英雄記錄

透過R套件iNEXT與Shiny包裝成線上軟體: 英雄聯盟口袋深度分析

·

·

13/15

http://loltw.gamebase.com.tw/

https://github.com/JohnsonHsieh/iNEXT

http://www.rstudio.com/shiny/

http://glimmer.rstudio.com/wush978/LOLChampion/

結語

a. 三種不同領域的資料分析，說明種類數的分析的重要性

b. 估計量數學形式簡單，應用到Big Data上仍有效率

c. 統計方法所省略的推導細節，請見參考文獻

14/15

參考文獻1. Chao, A. 1984. Nonparametric estimation of the number of classes in a population. Scandinavian

Journal of Statistics 11:265-270.

2. Colwell, R. K., A. Chao, N. J. Gotelli, S. Y. Lin, C. X. Mao, R. L. Chazdon, and J. T. Longino.

2012. Models and estimators linking individual-based and sample-based rarefaction,

extrapolation and comparison of assemblages. Journal of Plant Ecology 5:3-21.

3. Hsieh, T. C., K. H. Ma, and A. Chao. 2013. iNEXT online: interpolation and extrapolation

(Version 1.3.0) [Software]. Available from http://chao.stat.nthu.edu.tw/blog/software-download/.

4. Hsieh, T. C., K. H. Ma, and A. Chao. 2013. iNEXT: an R package for interpolation and

extrapolation species diversity. http://johnsonhsieh.github.io/iNEXT/

5. Ramnath V. 2012. slidify: Generate reproducible html5 slides from R markdown.

http://ramnathv.github.com/slidify/

6. Taiwan R User Group. 2013. R topic - estimation and prediction of richness. Programmer

magazine 12:48-53. http://programmermagazine.github.io/201312/htm/article6.html

15/15

http://chao.stat.nthu.edu.tw/blog/software-download/

http://johnsonhsieh.github.io/iNEXT/

http://ramnathv.github.com/slidify/

http://programmermagazine.github.io/201312/htm/article6.html

Documents

iNEXT: an r package for interpolation and extrapolation species diversity