RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

Jazz WangJazz WangYao-Tsung WangYao-Tsung Wang

jazz@nchc.org.twjazz@nchc.org.tw

Build Your Personal Build Your Personal Search Engine using CrawlzillaSearch Engine using Crawlzilla

WHO AM I ? JAZZWHO AM I ? JAZZ ？？WHO AM I ? JAZZWHO AM I ? JAZZ ？？• Speaker ：

– Jazz Yao-Tsung Wang / ECE Master

– Associate Researcher , NCHC, NARL, Taiwan

– Co-Founder of Hadoop.TW

– jazz@nchc.org.tw

• My slides are available on the website– http://trac.nchc.org.tw/cloud , http://www.slideshare.net/jazzwang

FLOSS UserFLOSS UserDebian/Ubutnu

Access GridMotion/VLC

Red5Debian Router

DRBL/ClonezillaHadoop

FLOSS EvalistFLOSS EvalistDRBL/ClonezillaPartclone/TuxbootHadoop Ecosystem

FLOSS DeveloperFLOSS DeveloperDRBL/Clonezilla

Hadoop Ecosystem

WHAT isBig Data?

Data Explosion!!Data Explosion!!

Source ： The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010,March 2007, An IDC White Paper - sponsored by EMChttp://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

Source ： Extracting Value from Chaos,June 2011, An IDC White Paper - sponsored by EMChttp://www.emc.com/collateral/about/news/idc-emc-digital-universe-2011-infographic.pdf

According to IDC reports:

2006 161 EB2007 281 EB2008 487 EB2009 800 EB (0.8 ZB) 2010 988 EB (predict)2010 1200 EB (1.2 ZB)2011 1773 EB (predict)2011 1800 EB (1.8 ZB)

Data expanded 1.6x each year !!Data expanded 1.6x each year !!

Commonly used software tools can't capture, manage and process.

'Big Data' = few dozen TeraBytes to PetaBytes in single data set.

What is Big Data?!What is Big Data?!

Multiple files, totally 20TB

Single Database, totally 20TB

Single file, totally 20TB

Source ： http://en.wikipedia.org/wiki/Big_data

Gartner Big Data Model ?Gartner Big Data Model ?

Challenge of 'Big Data' is to manage Volum, Variety and Velocity.Challenge of 'Big Data' is to manage Volum, Variety and Velocity.

Volume(amount of data)

Velocity(speed of data in/out)

Variety(data types, sources)

Batch Job

Realtime

Unstructured

Semi-structured

Structured

Source ：[1] Laney, Douglas. "3D Data Management: Controlling Data Volume, Velocity and Variety" (6 February 2001)[2] Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data, June 2011

12D of Information Management?12D of Information Management?

Source: Gartner (March 2011), 'Big Data' Is Only the Beginning of Extreme Information Management, 7 April 2011, http://www.gartner.com/id=1622715

Big Datais just the beginningOf ExtremInformation

Management

Possible Applications of Big Data?Possible Applications of Big Data?

Source: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdfSource: http://lib.stanford.edu/files/see_pasig_dic.pdf

Top 1 : Human Genomics – 7000 PB / YearTop 2 : Digital Photos – 1000 PB+/ YearTop 3 : E-mail (no Spam) – 300 PB+ / Year

HOW to deal with Big Data?

The SMAQ stack for big dataThe SMAQ stack for big data

Source ： The SMAQ stack for big data ， Edd Dumbill ， 22 September 2010 ，　　　　　　　　 http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

13/50Source ： The SMAQ stack for big data ， Edd Dumbill ， 22 September 2010 ，　　　　　　　　 http://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html

Three Core Technologies of Google ....Three Core Technologies of Google ....

• Google shared their design of web-search engine– SOSP 2003 :– “The Google File System”– http://labs.google.com/papers/gfs.html

– OSDI 2004 :– “MapReduce : Simplifed Data Processing on Large Cluster”– http://labs.google.com/papers/mapreduce.html

– OSDI 2006 : – “Bigtable: A Distributed Storage System for Structured Data”– http://labs.google.com/papers/bigtable-osdi06.pdf

Open Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core TechnologiesOpen Source Mapping of Google Core Technologies

Hadoop Distributed File System (HDFS)Hadoop Distributed File System (HDFS)Sector Distributed File SystemSector Distributed File System

Hadoop MapReduce APIHadoop MapReduce APISphere MapReduce API, ...Sphere MapReduce API, ...Hadoop MapReduce APIHadoop MapReduce API

Sphere MapReduce API, ...Sphere MapReduce API, ...

HBase, HBase, HypertableHypertableCassandra, ....Cassandra, ....

S = StorageS = StorageGoogle File SystemGoogle File System

To store petabytes of dataTo store petabytes of data

S = StorageS = StorageGoogle File SystemGoogle File System

To store petabytes of dataTo store petabytes of data

MMapapRReduceeduceTo parallel process dataTo parallel process data

Q = QueryQ = QueryBigTableBigTable

A huge key-value datastoreA huge key-value datastore

Q = QueryQ = QueryBigTableBigTable

A huge key-value datastoreA huge key-value datastore

Google's Stack Open Source Projects

HadoopHadoopHadoopHadoop

• http://hadoop.apache.org • Apache Top Level Project• Major sponsor is Yahoo!• Developed by Doug Cutting,

Reference from Google Filesystem• Written by Java, it provides HDFS and

MapReduce API• Used in Yahoo since year 2006• It had been deploy to 4000+ nodes in Yahoo• Design to process dataset in Petabyte• Facebook 、 Last.fm 、 Joost are also powered by

Hadoop

WHO willneeds Hadoop?

Let's take'Search Engine' as

an example

Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!Search is everywhere in our daily life !!

FileFile

MailMail PidgenPidgen

DatabaseDatabase

Web PagesWeb Pages

To speed up search, We need “Index”To speed up search, We need “Index”

Keyword

Page Number

History of Hadoop … History of Hadoop … 2001~20052001~2005

• Lucene– http://lucene.apache.org/– a high-performance, full-featured text search

engine library written entirely in Java. – Lucene create an inverse index of every word in

different documents. It enhance performance of text searching.

History of Hadoop … History of Hadoop … 2005~20062005~2006

• Nutch – http://nutch.apache.org/ – Nutch is open source web-search software.– It builds on Lucene and Solr, adding web-

specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

History of Hadoop … History of Hadoop … 2006 ~ Now2006 ~ Now

• Nutch encounter performance issue.• Reference from Google's papers.• Added DFS & MapReduce implement to Nutch• According to user feedback on the mail list of Nutch ....• Hadoop became separated project since Nutch 0.8• Nutch DFS → Hadoop Distributed File System (HDFS)• Yahoo hire Dong Cutting to build a team of web search

engine at year 2006.– Only 14 team members (engineers, clusters, users, etc.)

• Doung Cutting joined Cloudera at year 2009.

Do you like to write notes?Do you like to write notes?

Tools that I used to write notesTools that I used to write notesOddmuse WikiOddmuse Wikihttp://www.oddmuse.org/

2005~2008

Tools that I used to write notesTools that I used to write notesPmWikiPmWiki

http://www.pmwiki.org/

Tools that I used to write notesTools that I used to write notesScrapBookScrapBook

https://addons.mozilla.org/zh-TW/firefox/addon/scrapbook/

2005~NOW

Tools that I used to write notesTools that I used to write notesTrac = Wiki + Version Control (SVN or GIT)Trac = Wiki + Version Control (SVN or GIT)

http://trac.edgewall.org/

2006~NOW

Tools that I used to write notesTools that I used to write notesReadItLaterReadItLater

http://readitlaterlist.com/

2010~NOW

It's painful tosearch all my notes!

How do I search all my notesHow do I search all my notesfrom different websites?from different websites?

Feature of CrawlzillaFeature of Crawlzilla

● Cluster-based

● Chinese Word Segmentation

● Multiple Users with Multiple Index Pools

● Re-Crawl

● Schedule / Crontab

● Display Index Database (Top 50 sites, keywords)

● Support UTF-8 Chinese Search Results

● Web-based User Interface

Hadoop

Tomcat

Crawlzilla System Management

Lucene

NutchJSP + Servlet +

JavaBean

PC1 PC2 PC3

Web UI ( Crawlzil la Website + Search Engine)

System Architecture of CrawlzillaSystem Architecture of Crawlzilla

Crawlzilla Web UI

Index DB

Comparison with other projectsComparison with other projects

Spidr Larbin Jcrawl Nutch Crawlzilla

Install Rube

Package Install

Gmake Compiler and Install

Java Compiler and Install

Deploy Configure

Provide Auto Installation

Crawl website pages

O O O O O

Parser Content X X X O O

Cluster Computing X X X O O

Interface Command Command Command Command Web-UI

Support Chinese Segmentation

X X X X O

Multi-user Web Search Cloud Service : Crawlzilla 1.0Multi-user Web Search Cloud Service : Crawlzilla 1.0

▲ Step 1 ： Registration

http://demo.crawlzilla.info (1)

▲ Step 2 ： Acceptance Notification

Wait for notification from Administrator ！

(1)(2)

▲ Step 3 ： Login with your account

Login http://demo.crawlzilla.info (1)

▲ Step 4 ： Setup Name, URLs and depth

Setup new Search Pool(1)

▲ Step 5 ： wait for crawlzilla to generate Index DB

Wait for Crawlzilla to generate Index for you

▲ You could know how long to generate Index DB

Index Pools Management

▲ Manually Re-Crawl or Delete Index Database

You can manually 'Re-Crawl' and generate new Index DB

(1) (2)

Crawlzilla 1.0 Crawlzilla 1.0 多人版雲端服務多人版雲端服務 (8)(8)

▲ 您可以在索引庫管理看到目前爬取已使用的時間

可以於「系統排程」處進行排程重新爬取（ schedule ）

(3)(4)

▲ Display the basic information of Index Database

Display the Index Database Basic Informations

(1)(2)

▲ Add these syntax to your website

HTML Syntax for Your Personal Search Engine

(1) (2)

Total Web pages had been indexed,And Top 50 URLs

It also shows indexed Document Typesand Top 50 Keywords

(1) (2)

Crawlzilla Web UI

Index DB

ftp:// file:// Skydrive

Start from Here!Start from Here!

● Crawlzilla Demo Cloud Service● http://demo.crawlzilla.info

● Crawlzilla @ Google Code Project Hosting● http://code.google.com/p/crawlzilla/

● Crawlzilla @ Source Forge (Toturial in English)● http://sourceforge.net/p/crawlzilla/home/

● Crawlzilla User Group @ Google● http://groups.google.com/group/crawlzilla-user

● NCHC Cloud Computing Research Group● http://trac.nchc.org.tw/cloud

Authors of CrawlzillaAuthors of Crawlzilla

Waue Chen (Left) waue0920@gmail.com

Rock Kuo (Center)goldjay1231@gmail.com

Shunfa Yang (Right)shunfa@nchc.org.tw

shunfa@gmail.com

Questions?Questions?Slides - http://trac.nchc.org.tw/cloudSlides - http://trac.nchc.org.tw/cloud

RMLL 2013 : Build Your Personal Search Engine using Crawlzilla

Technology

Slides de pr©sentation - RMLL

RMLL 2012 - Conférence FuelPHP

Créer des documents accessibles RMLL 2011 AEGIS

Presentation OpenERP RMLL

PULSE 2 - RMLL 2010

Organiser efficacement son dépôt de code - RMLL 2011

FreeBSD vs Linux, RMLL 2014

Creating accessible documents RMLL 2011 AEGIS

Rmll 2010 AEGIS Mainstreaming Accessbility Open Source

Plaquette enalean-rmll-2011

Postgresql 9.0 HA at RMLL 2012

Emerginov RMLL 2011

sshGate - RMLL 2011

RMLL 2011 - L'accessibilité Web des CMS

LemonLDAP::NG et le support SAML2 (RMLL 2010)

Rmll opendata-2013

RMLL 2013 - Build your LDAP management web interface with LinID Directory Manager

RMLL 2009 - Présentation du WebSSO LemonLDAP::NG

Augeas @RMLL 2012

KBAccess RMLL 2010 (English)