20130325 mldm monday spide r

20130325 MLDM Monday

R 上的 spideR 寫作軍火庫

by c3h3

TW useR Group & MLDM Monday

● http://www.meetup.com/Taiwan-useR-Group/● http://www.facebook.com/TaiwanUseRGroup/● http://www.youtube.com/user/TWuseRGroup/● http://tw.use-r.net/

關於講者

● Chia-Chi Chang (c3h3)● Chief of Data Scientist of InnovoTECH● TW useR Group / MLDM Monday 創辦人之一

● R 、Python 和 Maple 的愛用者● 平時喜歡分析各種類型的資料、買賣金融商品；另外，也喜歡閱讀各種數學理論、模型、以

及它們的應用......

講題大綱

● spideR 寫作的預備知識

● spideR 的一些小範例

● spideR 的架構

● spideR 的寫作流程

● spideR 的一些小技巧

本次演講適合初學者請各位高手們忍耐一下囉！

預備知識

spideR 寫作的預備知識

● 什麼是網站？● 網站的結構？● 網址的祕密？● 網站資料的種類？● 分析的工具

什麼是網站?

一般人眼中的網站

設計師眼中的網站

工程師眼中的網站

那... spideR 眼中的網站呢?

網站的結構?

網站的結構（分類）

● 前端 V.S. 後端

● Model + View + Controler (MVC)

● Static V.S. Dynamic (Ajax)

MVC結構

Static V.S. Dynamic (Ajax)

● 範例：

● [Ajax] http://shop.myer.com.au/shop/mystore/973607510

● [Static] http://tw.stock.yahoo.com/d/s/major_2451.html

網址的祕密？

網址的祕密

● URL?var_1=val_1&var_2=val_2... ○ 其實，就像呼叫函數一樣

○ 相關訊息可在 form 中或 JS code 中找到○ http://finance.yahoo.com/q/hp?s=%

5ETWII&a=06&b=2&c=1997&d=02&e=24&f=2013&g=d○ http://www.taifex.com.tw/eng/eng3/eng3_2dl.asp?

COMMODITY_ID=all&DATA_DATE=2012/11/01&DATA_DATE1=2012/11/15

網址的祕密

● URL 中帶有規則

○ 有些網址會把訊息藏在 URL 中○ 然後，在由後端的 URL Dispatcher 解析

● URL 中帶有規則的範例：

○ http://tw.stock.yahoo.com/d/s/major_2451.html

○ URL規則： major_StockID.html

網站資料的種類？

● Page (HTML)● Data (JSON/XML...) ● File

網站資料 Data (JSON/XML...)

網站資料 File

常用的工具

● Google Chrome○ Developer Tools

● Firefox○ Firebug○ Hackbar○ Cookie Manager+

● cURL● Wireshark

一些小範例

[Example1] 抓股票代碼：

使用技術

● Example1_Extract_TWSE_Stock_IDs.R● R○ XML::htmlParse○ XML::readHTMLTable○ charToRaw○ gsub

● Reference:○ [共筆Blog] 去除 " " 的方法

○ R 的 regular expresssion 講義

[Example2] 抓取大戶進出：

使用技術

● Example2_Extract_Stock_Major_Data_Fom_Kimo.R

● R○ XML::htmlParse○ XML::readHTMLTable

回家作業：

● 綜合前兩個範例：

○ 抓取全部代碼的 ID○ 抓取 OTC 的資料

■ Hint： OTC_IDs○ 將所不同 ID 的 Data Table 用不同名稱命名

■ Hint1: 可以讓函數 output Data Table■ Hint2: 也可以用 assign 函數

○ 在 Data Table 中使用一個新欄位來存 ID ===> 建立總表

○ 在 Data Table 中使用一個新欄位來存日期

[Example3] 抓取0050代碼：

使用技術

● Example3_Extract_0050_IDs.R● R○ XML::htmlParse○ XPath Parser in XML

● Reference:○ http://www.w3.org/TR/xpath/

[Example4] 利用 ID 搭配 quantmod：

使用技術● Example4_Get_Stock_Data_From_Yahoo

Finance.R● R○ quantmod::getSymbols○ quantmod::chartSeries○ get○ assign

● Reference:○ Quantmod Web○ Quantmod Slide

[Example5] 找到後台的JSON時?

回家作業：

● 可以利用 R 中的 rjson 套件，練習處理看看賞面的網頁？

● Reference:○ rjson: http://cran.r-project.

org/web/packages/rjson/rjson.pdf

[Example6] 當遇到下載檔案時

使用技術● Example6_Download_CSV_File_From_T

WSE.R● R○ RCurl::getURL○ file■ writeLines■ readLines

○ textConnection○ read.table

回家作業：

● 接續上方範例......○ 運用 apply 對每一行都 parse 開○ 利用長度去掉不要的資料

○ 把留下的資料運用 do.call(rbind, data_list) 合成

○ 然後，製作成Data frame格式並存入 RData 檔案之中

[Example7] 看code學寫code

[Example7] 下載zip檔

使用技術

● Example7_Download_ZIP_File_From_Taifex.R

● R○ download.file○ unzip

[Example8] 當遇需要 Cookie 時

使用技術

● Example8_Download_CSV_File_From_Taifex_With_Cookie.R

● R○ RCurl::getCurlHandle○ RCurl::getURL(url,curl=curlHandle)○ XML::htmlParse○ XML::xmlAttrs

回家作業：

● 接續上方範例......○ 練習用 readline 讀入 unzip 出來的 rpt 檔○ 並將 rpt 檔轉換成 quantmod 可以分析用

的 xts 格式

spideR 的架構

● Web Connector○ RCurl

● Data Parser (Cleaner)○ XML

● Data Center○ RData File○ DB (SQLite, MySQL, PostgreSQL,

MongoDB, Redis, ........)

spideR 的寫作流程

● 確立目標?● 觀察網頁

● 頁面分類

● 分類頁面的 Connector 實作

● 分類頁面的 Parser 實作

● 資料庫比對與存取

一些小技巧

尋找「後台」的小技巧1 -- 監控

尋找「後台」的小技巧2 -- 找form

尋找「資料」的小技巧1 打開 hidden

尋找「資料」的小技巧2利用JQuery

尋找「資料」的小技巧3利用 JS debugger;

尋找「資料」的小技巧4停用 JS (停用前)

尋找「資料」的小技巧4停用 JS (停用後：推薦商品消失)

感謝大家

20130325 mldm monday spide r

Documents

Ex Amen Instructor Ppa 20130325

B3.0 c2 role of semantic models in smarter industrial ops mom c 3 reprinted with permission 20130325

ppp klima snackbar 20130325 korr1.ppt [Kompatibilitätsmodus]

JURISPRUDENCIA 20130325 20-Sec. 7ª Personal Al Servicio de Las AAPP (1)

MLDM CM Kaggle Tips

Nikovski PowerTheftDetection MLDM July2013 v2

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

MLDM: Python Metaclass in Practice

20130325 Energie Nederland Final Presentation (final) › ... › renewables › assets › pwc-energie-nederla… · Energie-Nederland is interested to understand what the potential

20130325 16-Sec. 6ª Expropiacion Forzosa

YouTube-Tutor - Startveranstaltung (MLDM: Musiklernen mit Digitalen Medien)

Ex Amen Instructor Pch 20130325

20130325 schrijven voor het web school voor bestuursrecht

黄宜华 Octopus 跨平台统一MLDM编程模型与平台

2010 Conference - Ethical Issues in Studying Minorities and Indigenous Peoples (Spide)

20130325 국민행복기금 KDB대우증권김민정 최종자료: 한국은행, kdb대우증권 리서치센터 자료: 한국은행, kdb대우증권 리서치센터 가계부채

Data Mining and Knowledge Discovery€¦ · Data Mining and Knowledge Discovery Course level: Master Course code: MLDM DMKD ECTS Credits: 4.00 Course instructors: Baptiste Jeudy,

20130325 衛環委員會-長照十年計劃專案報告

20130325 gauc enhanced campaigns

TRIPTICO TELPARK-SER 20130325 baja - Madrid · Independencia. Title: TRIPTICO TELPARK-SER 20130325 baja Created Date: 3/25/2013 4:07:32 PM