17
R STUDY -web crawling- E-GOV, KOOKMIN BIT Boo, Hyun-Kyung [email protected]

R study 01

Embed Size (px)

Citation preview

R STUDY -web crawling-

R STUDY-web crawling-E-GOV, KOOKMIN BIT Boo, [email protected]

CONTENTSWhat is web-crawler & crawling ?-

How to crawl ?- R

1. What is web-crawler ?A Web crawler, sometimes called a spider, is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering). [1]

A web crawler (also known as a robot or a spider) is a system for the bulk downloading of web pages. Web crawlers are used for a variety of purposes. [2]

Figure 1. Architecture of a Web crawler [1]

(web crawler) , . (ants), (automatic indexers), (bots), (worms), (web spider), (web robot) . (web crawling) (spidering) . . , . HTML , . . (seeds) URL , URL . URL . -https://ko.wikipedia.org/wiki/%EC%9B%B9_%ED%81%AC%EB%A1%A4%EB%9F%AC

1. What is web-crawling ?

Figure 2. Web crawling [3]Web crawling can be a very complicated and technical subject to understand. Every web page on the internet is different from the next, which means every web crawler is different ( at least in some way) from the next. [4]

1. What is web-crawling ?

Figure 3. how a crawler work [6]

* Web crawler & Web crawling , WEB , /[5]

, , ,

* = (ants), (automatic indexers), (bot), (worms), (web spider), (web robot)

* Scraping & Crawling

Table 1. The difference between scraping and crawling [9]

2. How to crawl using R?install R / R Studioinstall package - Stringr [10,11]Install : install.packages(Stringr) / Execution: Iibrary(stringr) HTML Site for studying HTML tag : https://www.w3schools.com/html/default.asp # (same to // in java) , (Run) ctrl + r

* (Java, Python, R ..)

2. How to crawl using R? - Think processWe need to select web-site for crawling . And then we do crawl everything in selected site (Encoding Type : Check site )URL ( . )Extract some line that you need (however we have to remove HTML tag.. etc) Create table / Save data like csv, txt format LETS START!

* The Comparison of URLS http://movie.naver.com/movie/bi/mi/point.nhn?code=134963

-> url -> ,

http://movie.daum.net/moviedb/grade?movieId=95306&type=netizen&page=1

view-source:http://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=134963&type=after&isActualPointWriteExecute=false&isMileageSubscriptionAlready=false&isMileageSubscriptionReject=false&page=1

2. How to crawl using R?

2. How to crawl using R?

For()

2. How to crawl using R?

Reference[1] https://en.wikipedia.org/wiki/Web_crawler[2] http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf[3] http://computer.howstuffworks.com/internet/basics/search-engine1.htm[4] https://blog.datafiniti.co/what-is-web-crawling-9184d019e094#.sa258aja8[5] https://ko.wikipedia.org/wiki/%EC%9B%B9_%ED%81%AC%EB%A1%A4%EB%9F%AC[6] https://www.import.io/post/how-to-crawl-a-website-the-right-way/[7] https://www.woorank.com/en/blog/how-a-crawler-works-back-to-the-basics[8] http://www.makeuseof.com/tag/build-basic-web-crawler-pull-information-website/[9] https://www.promptcloud.com/data-scraping-vs-data-crawling/[10] https://cran.r-project.org/web/packages/stringr/stringr.pdf[11] http://www.datamarket.kr/xe/board_BoGi29/12682 - stringr package [13] http://www.endmemo.com/program/R/gsub.php - stringr package [14] https://stat.ethz.ch/R-manual/R-devel/library/base/html/readLines.html -leadLines[15] http://rfunction.com/archives/2354 -sub/gsub[16] http://asheesh.org/pub/scrapy-talk/#44

Code sharePPT : https://www.slideshare.net/HyunKyungBooSource Code : https://github.com/boohk/R

Next Time..I will do something Using text data . like Analysis & Visualization

Thank you

https://boohk.github.io/