JinseogKim DonggukUniversity …datamining.dongguk.ac.kr/lectures/2017-2/Capstone_Design/...빅데이터란? 대량의정형/비정형데이터의집합 이러한데이터를 저장하고

빅데이터 분석에서 R의 활용

Jinseog KimDongguk University

[email protected]

2017.10.19

Jinseog Kim Dongguk University [email protected]빅데이터 분석에서 R의 활용

mailto:[email protected]


빅데이터란?

대량의 정형/비정형 데이터의 집합이러한 데이터를 저장하고가치있는 정보를 추출/분석하는 기술



빅데이터 문제

빅데이터 수집

웹로그웹 Scraping센서. . .

빅데이터 저장

데이터베이스 (Oracle, MySQL, )HADOOP (HDFS)

빅데이터의 처리/분석MapReduce (Hadoop)SparkTensorflow - Deep learning



빅데이터 처리에서 R의 한계

R은 모든 데이터를 메모리에 적재 : 대용량의 데이터 처리가 어려움직렬 컴퓨팅 : 처리가능한 크기의 데이터 하더라도 속도가 느림

⇒

R 병렬처리 모듈 (parallel, doParallel, . . . )효율적 빅데이터 저장 및 처리 시스템들과 연결

Hadoop, Spark, Tensorflow, etc.



Apache Hadoop

2005: Doug Cutting & Mike Cafarella (Yahoo! Nutch 프로젝트 - 웹크롤러)2008: Apache재단의 top-level project로 승격

HDFS(Hadoop Distributed File System) : 클러스터에서 분산 저장MapReduce : 분산처리(계산)



Hadoop - HDFS

Google File System (2003)에서 유래



Hadoop - MapReduce

MapReduce: Simplified Data Processing on Large Clusters (2004)Map + Reduce



Hadoop 2



Hadoop MapReduce의 문제점

부하가 심하고 속도가 느림 - 처리과정마다 하드디스크를 거쳐 데이터 공유MapReduce 프로그래밍은 어렵고 복잡한 코드



In-memory distributed processing

하드디스크 ⇒ 메모리

분산처리: RDD (Resilient Distributed Dataset)변경불가(Immutable)분산저장(Distributed)



Apache Spark

Aparche Spark (UC Berkley, AMPLab, 2009)MapReduce에 비해 약 100배 빠름SQL, Machine Learning, Graph processing 지원Java, python, scala, R을 지원환경: Stand alone, Hadoop, . . .Data souces: HDFS, Cassandra, Hbase, etc.



Apache Spark 역사

2009년 UC Berkley, AMPLab2011년 High level Component 개발2013/2014년 Apache로 이관 후 Top Project 승격2016년 version 2.0



Spark 라이브러리

JAVA, Scala, Python 그리고 R 인터페이스 제공Spark SQL : SQL과 구조화된 자료의 처리MLlib : 머신러닝GraphX : 그래프 프로세싱Spark streaming : 실시간 데이터의 처리



Spark Data format

RDD (Resilient Distributed Dataset): 일반적인 데이터 형태DataFrame: RDD의 Table 형태의 정형화된 데이터 포맷

연산시 최적화 가능SQL 연산 가능



Companies & Organizations

Amazon, eBay Inc., IBM AlmadenBaidu, TencentSamsung Research America, SK Telecom, Kakao, SK C&Cvisit https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark


https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark


SparkR: Spark의 R API

R코드를 이용한 분산 컴퓨팅 환경에서 Spark 연산 가능



SparkR의 실행 - 명령행

명령행에서 SparkR 실행

$ cd $SPARK_INSTALL_HOME$ ./bin/sparkR



RStudio에서 SparkR

# Reset environment for localSys.unsetenv("SPARKR_SUBMIT_ARGS")# SparkR 라이브러리 로드

SPARK_HOME <- "/usr/local/spark"library(SparkR, lib.loc=file.path(SPARK_HOME, "R", "lib"))# Spark connect (using all cores)sc <- sparkR.init(master="local[*]", sparkHome=SPARK_HOME,

sparkPackages="com.databricks:spark-csv_2.10:1.4.0")

# Spark SQL context의 생성

sqlContext <- sparkRSQL.init(sc)# ...

# stop SparksparkR.stop()



sparklyr 패키지

sparkR보다 쉬운 R패키지(Rstudio.com): R API for Spark + dplyr



dplyr : R의 data.frame을 핸들링하는 함수군으로 구성

함수명 내용 유사함수

copy_to() data.frame을 다른 소스로 변환filter() 지정한 조건식에 맞는 데이터 추출 subset()select() 열의 추출 data[, c("V1", "V2")]mutate() 열 추가 transform()arrange() 정렬 order(), sort()summarise() 집계 aggregate()group_by() 그룹별 집계

tbl_df() data.frame을 tbl로 변환as_data_frame() data.frame의 변환 as.data.frame()



sparklyr

sparklyr 함수

spark 실행 spark_connectspark 중지 spark_disconnectDataFrame 변환 dplyr::copy_to파일로딩 spark_read_csv

spark_read_json파일저장 spark_write_csv

spark_write_json



sparklyr의 시작

# Spark context의 생성(using all cores)# Sys.setenv(HADOOP_CONF_DIR = "/usr/local/hadoop/etc/hadoop")Sys.setenv(SPARK_HOME = "/usr/local/spark")library(sparklyr, lib.loc = "/usr/lib/R/library/")system.time({

sc <- spark_connect(master="local[*]", version="1.6.1")})## 사용자 시스템 elapsed## 0.101 0.080 43.272

# spark on yarn-client# sc <- spark_connect(master="yarn-client", version="1.6.1")

# stop Spark# spark_disconnect(sc)



Create a DataFrame

library(dplyr)library(nycflights13)flights_tbl <- copy_to(sc, flights, overwrite=T) # from data.framesrc_tbls(sc) # show all tables## [1] "flights" "iris"# from JSON or CSV filejson.path <- file.path(Sys.getenv("SPARK_HOME"),

"examples/src/main/resources/people.json")peopleDF <- spark_read_json(sc, "people", path=json.path, overwrite=T)



예제 데이터 : New York flights data

2013년 미국 New York에서 출발하는 항공기의 이착륙 기록 data.frame레코드 수 : 336776 / 변수: 19개 변수

변수 설명

year,month,day date of departuredep_time,arr_time Actual departure and arrival timessched_dep_time,sched_arr_time Scheduled departure and arrival timesdep_delay,arr_delay Departure and arrival delays, in minutescarrier Carrier(airlines) abbreviationtailnum Plane tail numberflight Flight numberorigin, dest Origin and destinationair_time Amount of time spent in the airdistance Distance flowntime_hour Scheduled date and hour of the flight



Basic operations (dplyr)

library(nycflights13)distinct(flights, origin, dest) # data.framedistinct(flights_tbl, origin, dest) # DataFrame

# Basic statistics for grouped data by piping (%>%)f.tmp <- flights_tbl %>% dplyr::filter(dep_delay == 2)head(f.tmp, 3)## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int> <chr> <chr> <dbl> <dbl>## 1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400## 2 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089## 3 2013 1 1 702 2 1058 44 B6 N779JB 671 JFK LAX 381 2475



Basic operations (dplyr)

delay <- flights_tbl %>%group_by(tailnum) %>%summarise(count = n(),

dist = mean(distance), delay = mean(arr_delay)) %>%dplyr::filter(count > 20, dist < 2000, !is.na(delay)) %>%collect()

head(delay, 3)## tailnum count dist delay## <chr> <dbl> <dbl> <dbl>## 1 N53442 108 1643.0370 0.5740741## 2 N789SW 30 967.3667 9.9655172## 3 N947DL 76 917.1974 4.0000000



Run Spark-MLlib using sparklyr

Spark MLlib sparklyr 함수

Decision Trees ml_decision_treeGradient-Boosted Tree ml_gradient_boosted_treesLinear Regression ml_linear_regressionLogistic Regression ml_logistic_regressionMultilayer Perceptron ml_multilayer_perceptronNaive-Bayes ml_naive_bayesPCA ml_pcaRandom Forests ml_random_forestMatrix factorization ml_als_factorizationK-Means Clustering ml_kmeansLatent Dirichlet Allocation ml_ldaSurvival Regression ml_survival_regression



Regression analysis (sparklyr) - (1)

# train/test setpartitions <- iris_tbl %>%

sdf_partition(training = 0.7, test = 0.3, seed = 12345)

# lasso paneltyfit <- partitions$training %>%

ml_linear_regression(response="Sepal_Length",features=c("Sepal_Width", "Petal_Length", "Petal_Width"),lambda = 1

)




# Model summarysummary(fit)## Deviance Residuals::## Min 1Q Median 3Q Max## -0.9212 -0.2935 -0.0428 0.2331 1.3318

## Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 5.0813694 0.2741410 18.5356 < 2.2e-16 ***## Sepal_Width -0.0061736 0.0786830 -0.0785 0.9376## Petal_Length 0.1217865 0.0200973 6.0598 2.878e-08 ***## Petal_Width 0.2440564 0.0453206 5.3851 5.417e-07 ***## ---## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

## R-Squared: 0.6361## Root Mean Squared Error: 0.4683




# Predictionpred <- sdf_predict(fit, partitions$test)head(pred)## Sepal_Length Sepal_Width Petal_Length Petal_Width Species prediction## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>## 1 4.5 2.3 1.3 0.3 setosa 5.298710## 2 4.6 3.1 1.5 0.2 setosa 5.293722## 3 4.6 3.4 1.4 0.3 setosa 5.304097## 4 4.6 3.6 1.0 0.2 setosa 5.229742## 5 4.8 3.4 1.9 0.2 setosa 5.340585## 6 4.9 2.4 3.3 1.0 versicolor 5.712505



사례 소개 1: 신용카드사 웹로그/ARS 이력 분석

목적: 콜센터 상담원 상담비용 절감상담비용 = 콜센터 상담원 인건비 + 통신비용

Figure 1: 콜센터의 고객상담 경로




빅데이터 시스템

Figure 2: 데이터 저장 및 분석시스템 구조



분석데이터

1 Web log 데이터고객ID접속시간web page1 -> web page2 -> ..

2 ARS 데이터고객ID, 접속시간, ARS Tag1 -> ARS Tag2 -> ..

3 상담원 연결 데이터

고객ID, 접속시간, 상담원 메모 Text

4 고객정보데이터

고객ID, 성별, 연령, 우편번호, 외국인여부, 직종/직위,가입상품정보, 가입기간, 등급카드서비스활용정보 (일시불, . . . )신용/결재/연체




고객구분 회원수(천)

전체 고객 17,033정상 고객 6,275인터넷 고객 4,805웹사용 고객 9912개월 웹이용고객수 1,3142개월 세션수 5,300PV수 35,000고객당 세션수 4세션당 PV수 6.6




1 웹로그분석을 통해 고객특성별 웹페이지(서비스)의 이용 현황 및 패턴 파악웹이용고객의 수, 전체 고객대비 비율페이지 뷰 (page view, 페이지 당 고객의 방문 수)웹이용고객의 방문수(세션수) 분포세션의 길이(한 세션에서 방문한 페이지의 수) 분포총 세션시간(로그인후 세션종료까지의 경과 시간)의 분포페이지별 노출시간(특정페이지에 머무르는 시간)

2 ARS 이력 분석을 통해 고객특성(또는 유형)별 ARS 이용 현황 파악3 웹서비스와 ARS인입과의 연관성 분석

ARS 서비스의 유형분석웹이용 고객특성에 따른 ARS 서비스의 이용 패턴 분석

4 웹서비스와 상담원 연결과의 연관성 분석

상담원 연결이 많은 상담(서비스)유형 파악 text mining 기법을 이용한 상담유형분류고객의 특성별 ARS 및 상담원 상담 유형/패턴분석



사례 소개 2: 사회조사 통계표의 작성

데이터: 3개년도, 지자체 사회조사 자료 (정형데이터)결과물: 모든 항목별 기초 지자체별/ 5개 분류변수별 통계표 작성

결과표의 수 : ≈= 45,000 개의 테이블 생성

Questions?어떻게 빠르게 결과물을 생산할까?결과물의 편집은?

My solution is:어떻게 빠르게 결과물을 생산할까? - parallel computing결과물의 편집은? - automatic output generation




Figure 3: 분석프로세스Jinseog Kim Dongguk University [email protected]빅데이터 분석에서 R의 활용



http://datamining.dongguk.ac.kr/ftp/gyeongbuk2017/


http://datamining.dongguk.ac.kr/ftp/gyeongbuk2017/



R program: Rmd file 생성Rmd compile: rmarkdown::render()Batch execution

R CMD BATCH --no-save --no-restore \'--args YEAR=2014 rmd_version="5.0"' generate_pdf.R errors_pdf.txt &




분석 설계 및 프로그램 작성: 3개월프로그램 수행시간: 20~30분 / 연도별 (40-core server 이용)



사례 소개 3: Web scraping & Text mining

US patent and trademark office

Figure 4: US patent and trademark officeJinseog Kim Dongguk University [email protected]빅데이터 분석에서 R의 활용



library(rvest)url<-'https://search.uspto.gov/search?affiliate=web-sdmg-uspto.gov&op=Search&page=1&query=energy'webpage <- read_html(url3)content <- html_nodes(webpage3, xpath='//*[@class="description"]')desc <- html_text(content3)




head(desc)

## [1] "\nClass 250 RADIANT ENERGY Class definitions may be accessed by clicking on the class title, above. Subclass definitions may be accessed by clicking on ...\n"## [2] "\nRADIANT ENERGY GENERATION AND SOURCES : 494.1 : Plural radiation sources : 495.1 : Including an infrared source : 496.1 :\n"## [3] "\nSECTION I - CLASS DEFINITION. This is the residual class for methods and apparatus involving radiant energy. SCOPE OF THE CLASS. This class provides ...\n"## [4] "\n11 months ago - \nJun 29, 2016 01:00 PM ET Washington, DC\n"## [5] "\nThis is the residual class for methods and apparatus involving radiant energy. SCOPE OF THE CLASS This class provides for all methods and apparatus ...\n"## [6] "\nadvice. Accelerated Review of Green Technology Patent Applications. On December 8, 2009, the USPTO launched a pilot program to accelerate the review ...\n"




library(stringr)desc <- str_replace_all(desc, "[[:punct:]]", "")# 문장부호 삭제

desc <- str_replace_all(desc, "PDF|\n", "")# 불필요 단어 삭제

desc <- str_replace_all(desc, "\\[|\\]", "")desc <- str_trim(desc, side = "both")# 문장의 처음, 마지막 공백 삭제

head(desc)

## [1] "Class 250 RADIANT ENERGY Class definitions may be accessed by clicking on the class title above Subclass definitions may be accessed by clicking on"## [2] "RADIANT ENERGY GENERATION AND SOURCES 4941 Plural radiation sources 4951 Including an infrared source 4961"## [3] "SECTION I CLASS DEFINITION This is the residual class for methods and apparatus involving radiant energy SCOPE OF THE CLASS This class provides"## [4] "11 months ago Jun 29 2016 0100 PM ET Washington DC"## [5] "This is the residual class for methods and apparatus involving radiant energy SCOPE OF THE CLASS This class provides for all methods and apparatus"## [6] "advice Accelerated Review of Green Technology Patent Applications On December 8 2009 the USPTO launched a pilot program to accelerate the review"




solar cellsconversion efficiencyso

lar c

ell

powe

r con

vers

ion

sensitized solar

dye sensitized

thin filmopen circuit

thin filmsshort circuit

circ

uit v

olta

geci

rcui

t cur

rent

band gap

current density

ma cmfill factor

electron microscopypedot pss

ray diffractionbulk heterojunction

active layer

charge transferorganic solar

cells dsscs

donor acceptorenergy conversion

polymer solar

low cost

scanning electron

uv vis

photovoltaic performance

optical properties

efficiency pce

butyric acid

quantum dots

methyl ester

acid methyl

tin oxide

poly hexylthiophene

film solar

results show

counter electrode

high

effi

cien

cy

device performance

conversion efficiencies

silicon solar

photovoltaic pv

photovoltaic devices

voltage oc

solid state

mw cm

light harvesting

light absorption

quantum efficiency

p3ht pcbm

room temperature

transmission electron

hete

roju

nctio

n so

lar

surface areamaximum powerelectron transfer

organic photovoltaic

quantum dot100 mw

current voltage

char

ge tr

ansp

ort

si solar

high performancedensity functional

impedance spectroscopy

electron transport

cell

perfo

rman

ce

buffer layer

perovskite solar

sola

r ene

rgy

electrochemical impedance

cells

bas

ed

solution processed

light trapping

functional theory

core shell

photovoltaic properties

density sc

low temperature

experimental results

charge recombination

conduction band

charge carrier ener

gy le

vels

electrical properties

light

sca

tterin

g

indium tin

optical absorption

based solar

visible light

hexy

lthio

phen

e p3

ht

atomic force

amorphous silicon

cell efficiency

force microscopy

photoelectron spectroscopy

crystalline silicon



References

1 Holmes (2014). Hadoop in Practice, Manning Publications.2 McCallum & Weston (2012). Parallel R, O’REILLY.3 Prajapati (2013). Big Data Analytics with R and Hadoop, PACKT.4 White (2011). Hadoop: The Definitive Guide, O’REILLY.5 http://www.h2o.ai/6 Zaharia et al. (2012). Resilient distributed datasets: A fault-tolerant

abstraction for in-memory cluster computing.7 Meng, X., Bradley, J., et al. (2016). Mllib: Machine learning in apache

spark. JMLR, 17(34), 1-7.8 https://spark-summit.org/2014/wp-content/uploads/2014/07/

SparkR-SparkSummit.pdf9 http://spark.apache.org/docs/latest/sparkr.html10 http://spark.rstudio.com/h2o.html11 https://spark.rstudio.com/index.html


http://www.h2o.ai/

https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

https://spark-summit.org/2014/wp-content/uploads/2014/07/SparkR-SparkSummit.pdf

http://spark.apache.org/docs/latest/sparkr.html

http://spark.rstudio.com/h2o.html

https://spark.rstudio.com/index.html


Documents

JinseogKim DonggukUniversity …datamining.dongguk.ac.kr/lectures/2017-2/Capstone_Design/...빅데이터란? 대량의정형/비정형데이터의집합 이러한데이터를 저장하고