A Web Crawler Design for Data Mining Mike Thelwall University of Wolverhampton, Wolverhampton, UK...

A Web Crawler Design for Data Min-ing

Mike ThelwallUniversity of Wolverhampton, Wolverhampton, UKJournal of Information Science 2001

27 April 2011Presentation @ IDB Lab Seminar

Presented by Jee-bum Park

Outline Introduction Architecture Implementation System Testing Conclusion

Introduction

- Motive

The importance of the web has guaranteed academic interest in it, not only for affiliated technologies, but also for its content

Introduction

- Motive

Information scientists and others wish to perform data mining on large numbers of web pages

They will require the services of a web crawler,– To extract patterns from the web– To extract meaning from the link structure of the web

The necessity of an effective paradigm for a web min-ing crawler

Introduction

- Web Crawler

A web crawler, robot or spider

A program that is capable of iteratively and automat-ically,– Downloading web pages– Extracting URLs from their HTML– Fetching them

Introduction

- Web Crawler: Workflow

/• index.html• login.php

/images/

• logo.gif• menu.jpg• bg.png

/board/

• index.php• index.php?

id=2• Index.php?

/board/files/

• a.jpg• b.txt• c.zip

http://idb.s-nu.ac.kr/

Web Crawler

Introduction

- Web Crawler: Architecture

Introduction

- Web Crawler: Roles

A sophisticated web crawler may also perform,– Identifying pages judged relevant to the crawl– Rejecting pages as duplicates of ones previously visited– Supporting the action of search engines

For example, constructing the searchable index

Introduction

- Web Crawler: Issue

In the normal course of operation,a simple crawler will spend most of its time awaiting data– Requesting a web page– Receiving a web page

For this reason, crawlers are normally multi-threaded If the crawling task requires more complex process-

ing,the speed of the crawler will be reduced

A distributed approach for crawlers is needed

Introduction

- Distributed Systems

Using idle computers connected to the internet– To gain extra processing power– To distribute processing power

For personal site-specific crawlers, a single personal computer solution may be fast enough

An alternative is a distributed model– A central control unit– Many crawlers operating on individual personal computers

Architecture The crawler/analyzer units The control unit

Four constraints1. Almost all processing should be conducted on idle com-

puters2. The distributed architecture should not increase network

traffic3. The system must be able to operate through a firewall4. The components must be easy to install and remove

Architecture

Control unit

Crawleridb.snu.ac.kr

Crawlerbrahma.snu.ac.kr

Crawlersugang.snu.ac.kr

Crawleretl.snu.ac.kr

Crawlermy.snu.ac.kr

Crawlersiva.snu.ac.kr

Architecture

- The Crawler/Analyzer Units

The program– Crawl a site or set of sites– Analyze the pages– Report its results

It can execute on the type of computers on whichthere will be spare time, normally personal comput-ers

Architecture

- The Crawler/Analyzer Units: Data Management

Accessing permanent storage space to save the web pages– Linking to a database– Using the normal file storage system

Pages must be saved on each host computer,in order to minimize network traffic

If the system is capable of handling enough data,a large-scale server-based database can be used

It must provide a facility for the user to delete all saved data

Architecture

- The Crawler/Analyzer Units: Interface

Immediate stop

Clear all data from the computer

Architecture

- The Control Unit

The control unit will live on a web server

When a crawler unit requests a job or sends some data,It will be triggered

It will need to store the commands– The owner wishes to be executed– Indicating status

Completed In progress Unallocated

Architecture

Control unit

Crawleridb.snu.ac.kr

Crawlerbrahma.snu.ac.kr

Crawlersugang.snu.ac.kr

Crawleretl.snu.ac.kr

Crawlermy.snu.ac.kr

Crawlersiva.snu.ac.kr

Implementation

The architecture was employed to create a system for analyzing the link structure of university web sites

Implementation

Previous system– Running a single crawler/analyzer program

Issues– Not run quickly enough– Individually set up and run on a number of computers– Inefficient in terms of both human time and processor use!

New system– The existing stand-alone crawler was used as the basis– Communication and easy installation features added– Buttons to instantly close the program and remove any

saved data– Processed by compressor for easy distribution

Implementation

Choice of the types of checking for duplicate pages– No page checking– HTML page checking– Weak HTML page checking

Comparing methods– Comparing each page against all of the others

– Various numbers were calculated from the text of each page For example, the length of the page, MD5 or SHA-1 hash, etc.

Implementation

- The Control Unit

Entirely new!

It was given a reporting facility– Statistics– To deliver a summary of crawlers

System Testing In June and July of 2000

A set of sites or web pages to download An analysis to perform on the downloaded sites

System Testing

- Result

The total number of crawler units– Peaked at just over 100 with three rooms of computers

9112 tasks completed by the system Over 100,000 pages downloaded

Each crawler used approximately 1 GB of hard disk space

The system had become a virtual computer with over100 GB of disk space and over 100 processors

System Testing

- Limitations

The system was not able to run fully automatically

The problem was randomly generated web pages– For example, a huge set of web pages containing usage sta-

tistics for electronic equipment with one page per device per day

The solution was– To manually check the root cause of the problem– To add their URLs to a banned list operated by the control

There is the alternative of designing a heuristic to avoid problems– For example, a maximum crawl depth

Conclusion The distributed architecture has shown itself

– Capable of crawling a large collection of web sites– By using idle processing power and disk space

The testing of the system has shown that– It cannot operate fully automatically– Without an effective heuristic for identifying duplicate pages

Conclusion The architecture is particularly suited to situations

– Where a task can be decomposed into a collection of crawl-ing based tasks

It would be unsuitable if– The crawls had to cross-reference each other– The data mining had to be performed in an integrated way

The architecture is an effective way to use idle com-puting resources in order to perform large-scale web data mining tasks

Thank You!Any Questions or Comments?

A Web Crawler Design for Data Mining Mike Thelwall University of Wolverhampton, Wolverhampton, UK...

Documents

45 Clive Road, Pattingham, Wolverhampton, South ...€¦ · Wolverhampton WV6 8QS 01902 747744 tettenhall@berrimaneaton.co.uk 45 Clive Road, Pattingham, Wolverhampton, South Staffordshire,

Wolverhampton People's Parliament

Why we need another ten years of Bibliometric-enhanced ...ceur-ws.org/Vol-2591/paper-12.pdf · Mike Thelwall University of Wolverhampton, UK m.thelwall@wlv.ac.uk Congratulations to

WOLVERHAMPTON CITY COUNCIL THE WOLVERHAMPTON CITY …tro.trafficpenaltytribunal.gov.uk/TRO/Wolverhampton/WV36.pdf · 2015. 11. 2. · Wolverhampton City Council (Penn) (Prohibition

Patent Citation Analysis with Google · Kayvan Kousha and Mike Thelwall Statistical Cybermetrics Research Group, School of Mathematics and Computer Science, University of Wolverhampton,

Methods for Exploiting Academic Hyperlinks Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK

Wolverhampton Bus Case

Wolverhampton presentation

Christmas Songs 2021 Ukulele Group Lymm & Thelwall U3A

Link analysis as a social science technique Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK

Extracting Information from the Links in Academic Webs Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK An overview

How common are explicit research questions in journal articles? · 2020. 5. 12. · Mike Thelwall and Amalia Mas-Bleda Statistical Cybermetrics Research Group, University of Wolverhampton,

TaxAssist Accountants Wolverhampton

Members for Healthwatch Wolverhampton Advisory … · Healthwatch Wolverhampton Advisory Board Recruitment Pack ... City of Wolverhampton Council, ... at bringing people together

WhatHouse? Local Wolverhampton

University of Wolverhampton - University of Wolverhampton ......Research Students’ Handbook Effective from 1 January 2014 MAC1981 University of Wolverhampton Wulfruna Street, Wolverhampton

Wolverhampton Homes

Mosaic Wolverhampton

Sentiment analysis: A combined · PDF fileSentimentAnalysis:ACombinedApproach Rudy Prabowo1, Mike Thelwall School of Computing and Information Technology University of Wolverhampton

Wolverhampton Safeguarding Together...2 Wolverhampton Safeguarding Together: our arrangements for safeguarding children and young People in Wolverhampton Wolverhampton Safeguarding