Upload
cloudera-inc
View
1.815
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
Citation preview
A Big Data Refinery Built on HBase
Stanislav BartonInternet Memory Research
A Content-oriented platform
• Mignify = a content-oriented platform which– Continuously (almost) ingests Web documents– Stores and preserves these documents as such AND– Produces structured content extracted (from single
documents) and aggregated (from groups of documents)– Organizes both raw documents, extracted and aggregated
information in a consistent information space.=> Original documents and extracted information are uniformly
stored in HBase
A Service-oriented platform
• Mignify = a service-oriented platform– Physical storage layer on “custom hardware”– Crawl on-demand, with sophisticated navigation options– Able to host third-party extractors, aggregators and
classifiers– Run new algorithms on existing collections as they arrive– Supports search, navigation, on-demand query and
extraction
=> A « Web Observatory » built around Hadoop/Hbase.
–
Customers/Users
• Web Archivists – store and organize, live search
• Search engineers – organize and refine, big throughput
• Data miners – refine, secondary indices• Researchers – store, organize and/or refine
Talk Outline
• Architecture Overview• Use Cases/Scenarios• Data model• Queries/ Query language / Query processing• Usage Examples• Alternative HW platform
Overview of Mignify
Typical scenarios
• Full text indexers– Collect documents, with metadata and graph
information• Wrapper extraction– Get structured information from web sites
• Entity annotation– Annotate documents with entity references
• Classification– Aggregate subsets (e.g., domains); assign them to topics
Data in HBase
• HBase as a first choice data store:– Inherent versioning (timestamped values)– Real time access (Cache, index, key ordering)– Column oriented storage– Seamless integration with Hadoop– Big community – Production ready/mature implementation
Data model
• Data stored in one big table along with metadata and extraction results
• Though separated in column families – CFs as secondary indices
• Raw data stored in HBase ( < 10 MB, HDFS otherwise)
• Data stored as rows (versioned)• Values are typed
Types and schema, where and why
• Initially, our collections consists of purely non-structured data
• All the point for Mignify is to produce a backbone of structured information,
• Done through an iterative process which progressively builds a rich set of interrelated annotations
• Typing = important to ensure the safety of the whole process, automation
Version1:(A1.. Ak), Version2:(Ak+1, … Am),…
Data model II
<(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>v1,… vm:byte[]
<t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>
HTable
Versions
A= <CF, Qualifier>:TypeType:Writable or byte[]Schema
Resource
CF1 CF2 CFnHFiles
mignify
HBase
Extraction PlatformMain Principles
A framework for specifying data extraction from very large datasets Easy integration and application of new extractors
High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks An extraction process specification combines these
elements [currently] A single extractor engine
Based on the specification, data extraction is processed by a single, generic MapReduce job.
Extraction PlatformMain Concepts
Important: typing (We care about types and schemas!) Input and Output Pipes
Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …)
Filters (Boolean operators that apply to input data) Extractors
Takes an input Resource, produce Features Views
Combination of input and output pipes, filters and extractors
Data Queries
• Various data sources (HTable, data files,…)• Projections using column families and qualifiers• Selections by HBase filtersFilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));
ret.addFilter(f1);
• Query results either flushed to files or back to HBase –> materialized views
• Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters
Query Language (DML)
• For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees
WHERE salary >= 2000– f(hire_date, salary)= mon(today-hire_date)*salary
• For each web page detect mime type, language– For each RSS feed, get summary,– For each HTML page extract plain text
• Currently a wizzard producing a JSON doc
User functions
• List<byte[]> f(Row r)• May calculate new attribute values, stored
with Row and reused by other function • Execution plan: order matters!• Function associates description of input and
output fields– Fields dependencies give order
Input Pipes
• Defines how to get the data – Archive files, Text files, HBase table– Format– The Mappers have always Resource on input,
several custom InputFormats and RecordReaders
Output pipes
• Defines what to do with (where to store) the query result– File, Table– Format– What columns to materialize
• Most of the time PutSortReducer used so OP defines OutputFormat and what to emit
Query Processing
• With typed data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!!
• Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer
• Job init: build combined filter, IP and OP to define input and output formats
• Mapper set-up: Init plan of user function applications, init of functions themselves
• Mapper map: apply functions on a row according to plan, use OP to emit values
• Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)
Query Processing II
Archive File
HBase
Map
Redu
ceHBase
ViewsFilters Extractors
Reso
urce H
File
File
Co-scanner
Data File
Data Subscriptions / Views• Data-in-a-View satisfaction can be checked at
the ingestion time, before the data is inserted• Mimicking client side co-processors – allowing
the use of bulk loading (no coprocessors for bulk load in the moment)
• When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls
Second Level Triggers
• User code ran in the reduce phase (when ingesting):– Put f(Put p, Row r) – Previous versions on the input of the code Can alter the
processed Put object before final flush to HFile• Co-scanner: User-defined scanner traversing the
processed region aligned with the keys in the created HFile
• Example: Change detection on a re-crawled resource
Aggregation Queries• Compute frequency of mime types in the
collection• For a web domain, compute spammicity using
word distribution and out-links• Based on a two-step aggregation process– Data is extracted in the Map(), emitted with the
View signature– Data is collected in the Reduce(), grouped on the
combination (view sig., aggr. key) and aggregated.
Aggregation Queries Processing
• Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive)
• Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…>
• Aggregation reduce phase: reduce(Agg_key, values, context)
Aggregation Processing Numbers
Compute mime type distribution of web pages per PLD:SELECT pld, mime, count(*) FROM web_collection
GROUP BY extract_pld(url), mime
Data Ingestion
Our crawler asynchronously writes to Hbase:
Input: archive files (ARC, WARC) in HDFSOutput: Htable
SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz
1. Pre-compute the split region boundaries on a data sample– MapReduce on a data input sample
2. Process batch (~0.5TB) MapReduce ingestion3. Split manually too big regions (or candidates)4. If there is still input go to 2.
Data Ingestion Numbers
• Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds
• Reaching steady 40MB/s including extraction • Upper bound 170MB/s (distributed reading of
archive files in HDFS)• HBase is idle most of the time!– Allows compacting store files in the meanwhile
Web Archive Collection
• Column families (basic set): 1. Raw content (payload)2. Meta data (mime, IP, …)3. Baseline analytics (plain text, detected mime, …)
• Usually one another CF per analytics result• CFs as secondary indices:– All analyzed feeds at one place (no need for filter
if I am interested in all such rows)
Web Archive Collection II
• More than 3,000 regions (in one collection)• 12TB of compressed of indexable data (and
counting)• Crawl to store/process machine ratio is 1:1.2• Storage scales out
HW Architecture
• Tens of small low-consumption nodes with a lot of disk space:– 15TB per node, 8GB RAM, dual core CPU– No enclosure -> no active cooling -> no expensive
datacenter-ish environment needed• Low per PB storage price (70 nodes/PB), car
batteries as UPS, commodity (real low-cost) HW (esp. disks)
• Still some reasonable computational power
Conclusions
• Conclusions– Data refinery platform– Customizable, extensible– Large scale
• Future work– Incorporating external secondary indices to filter HBase
rows/cells• Full text index filtering• Temporal filtering
– Larger (100s TBs) scale deployment
Acknowledgments
• European Union projects:– LAWA: Longitudinal Analytics of Web Archive data– SCAPE - SCAlable Preservation Environments