Upload
julien-nioche
View
269
Download
4
Embed Size (px)
Citation preview
An introduction to
Julien [email protected]
@digitalpebble
Java Meetup – Bristol 10/03/2015
Storm Crawler
2 / 31
About myself
DigitalPebble Ltd, Bristol (UK)
Specialised in Text Engineering– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning
Strong focus on Open Source & Apache ecosystem
User | Contributor | Committer– Nutch– Tika– SOLR, Lucene – GATE, UIMA– Behemoth
3 / 31
Collection of resources (SDK) for building web crawlers on Apache Storm
https://github.com/DigitalPebble/storm-crawler Artefacts available from Maven Central Apache License v2 Active and growing fast
=> Scalable=> Low latency=> Easily extensible=> Polite
Storm-Crawler : what is it?
4 / 31
What it is not
A ready-to-use, feature-complete web crawler– Will provide that as a separate project using S/C later
e.g. No PageRank or explicit ranking of pages– Build your own
No fancy UI, dashboards, etc...– Build your own or integrate in your existing ones
Compare with Apache Nutch later
5 / 31
Apache Storm
Distributed real time computation system
http://storm.apache.org/
Scalable, fault-tolerant, polyglot and fun!
Implemented in Clojure + Java
“Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more”
6 / 31
Storm Cluster Layout
http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime%20Processing%20with%20Storm%20Presentation.pdf
7 / 31
Topology = spouts + bolts + tuples + streams
8 / 31
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout(1) pull URLs from an external source
(2) group per host / domain / IP
(3) fetch URLs
Defa
ult
st
ream
(4) parse content
(3) fetch URLs
(5) index text and metadata
9 / 31
Basic Topology : Tuple perspective
Fetcher
Parser
URLPartitioner
Indexer
Some Spout(1) <String url, Metadata m>
(2) <String url, String key, Metadata m>
Defa
ult
st
ream
(4) <String url, byte[] content, Metadata m, String text>
(3) <String url, byte[] content, Metadata m>
(1)
(2)
(3)
(4)
10 / 31
Basic scenario
Spout : simple queue e.g. RabbitMQ, AWS SQS, etc... Indexer : any form of storage but often index e.g. SOLR,
ElasticSearch
Components in grey : default ones from SC
Use case : non-recursive crawls (i.e. no links discovered), URLs known in advance. Failures don't matter too much.
“Great! What about recursive crawls and / or failure handling?”
11 / 31
Frontier expansion
Manual “discovery”– Adding new URLs by
hand, “seeding”
Automatic discovery of new resources (frontier expansion)– Not all outlinks are
equally useful - control– Requires content
parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
12 / 31
Recursive Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Defa
ult
st
ream
Status Updater
Status stream
Switch
DB
13 / 31
Status stream
Used for handling errors and update info about URLs Newly discovered URLs from parser
public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}
StatusUpdater : writes to storage Can/should extend AbstractStatusUpdaterBolt
14 / 31
Error vs failure in SC
Failure : structural issue e.g. dead node– Storm's mechanism based on time-out– Spout method ack() / fail() get message ID
• can decide whether to replay tuple or not• message ID only => limited info about what went wrong
Error : data issue– Some temporary e.g. target server temporarily unavailable– Some fatal e.g. document not parsable– Handled via Status stream – StatusUpdater can decide what to do with them
• Change status? Change scheduling?
15 / 31
Topology in code
Grouping!
Politeness and data locality
16 / 31
Launching the Storm Crawler
Modify resources in src/main/resources– URLFilters, ParseFilters etc...
Build uber-jar with Maven– mvn clean package
Set the configuration– YAML file (e.g. crawler-conf.yaml)
Call Stormstorm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
17 / 31
18 / 31
Comparison with Nutch
19 / 31
Comparison with Nutch
Nutch is batch driven : little control on when URLs are fetched– Potential issue for use cases where need sessions– latency++
Fetching only one of the steps in Nutch– SC : 'always be fetching' (Ken Krugler); better use of resources
Make it even more flexible– Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
Not ready-to use as Nutch : it's a SDK Would not have existed without it
– Borrowed code and concepts
20 / 31
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/
21 / 31
Resources
Separated in core vs external Core
– URLPartitioner– Fetcher– Parser– SitemapParser– …
External– Metrics-related : inc. connector for Librato– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
Also user maintained external resources
22 / 31
FetcherBolt
Multi-threaded Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname– Sets delay between requests from same queue– Respects robots.txt
Protocol-neutral– Protocol implementations are pluggable– HTTP implementation taken from Nutch
23 / 31
ParserBolt
Based on Apache Tika Supports most commonly used doc formats
– HTML, PDF, DOC etc...
Calls ParseFilters on document– e.g. scrape info with XPathFilter
Calls URLFilters on outlinks– e.g normalize and / or blacklists URLs based on RegExps
24 / 31
Other resources
URLPartitionerBolt– Generates a key based on the hostname / domain / IP of URL– Output :
• String URL• String key• String metadata
– Useful for fieldGrouping
URLFilters– Control crawl expansion– Delete or rewrites URLs– Configured in JSON file– Interface
• String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
25 / 31
Other resources
ParseFilters– Called from ParserBolt– Extracts information from document– Configured in JSON file– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing
26 / 31
Integrate it!
Write the Spout for your usecase– Will work fine existing resources as long as it generates URL, Metadata
Typical scenario– Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)– Write Spout for it and throttle with topology.max.spout.pending– So that can enforce politeness without getting timeout on Tuples → fail– Parse and extract– Send new URLs to queues
Can use various forms of persistence for URLs– ElasticSearch, DynamoDB, HBase, etc...
27 / 31
Some use cases (prototype stage)
Processing of streams of URLs (natural fit for Storm)– http://www.weborama.com
Monitoring of finite set of URLs– http://www.ontopic.io (more on them later)– http://www.shopstyle.com : scraping + indexing
One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing
Recursive crawler– http://www.shopstyle.com : discovery of product pages
28 / 31
What's next?
0.5 released soon?
All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
Additional Parse/URLFilters
Facilitate writing IndexingBolts
More tests and documentation
A nice logo or better name?
29 / 31
Resources
https://storm.apache.org/ “Storm Applied” http://www.manning.com/sallen/
https://github.com/DigitalPebble/storm-crawler
http://nutch.apache.org/
30 / 31
Questions
?
31 / 31