HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research

A Big Data Refinery Built on HBase

Stanislav BartonInternet Memory Research

A Content-oriented platform

• Mignify = a content-oriented platform which– Continuously (almost) ingests Web documents– Stores and preserves these documents as such AND– Produces structured content extracted (from single

documents) and aggregated (from groups of documents)– Organizes both raw documents, extracted and aggregated

information in a consistent information space.=> Original documents and extracted information are uniformly

stored in HBase

A Service-oriented platform

• Mignify = a service-oriented platform– Physical storage layer on “custom hardware”– Crawl on-demand, with sophisticated navigation options– Able to host third-party extractors, aggregators and

classifiers– Run new algorithms on existing collections as they arrive– Supports search, navigation, on-demand query and

extraction

=> A « Web Observatory » built around Hadoop/Hbase.

–

Customers/Users

• Web Archivists – store and organize, live search

• Search engineers – organize and refine, big throughput

• Data miners – refine, secondary indices• Researchers – store, organize and/or refine

Talk Outline

• Architecture Overview• Use Cases/Scenarios• Data model• Queries/ Query language / Query processing• Usage Examples• Alternative HW platform

Overview of Mignify

Typical scenarios

• Full text indexers– Collect documents, with metadata and graph

information• Wrapper extraction– Get structured information from web sites

• Entity annotation– Annotate documents with entity references

• Classification– Aggregate subsets (e.g., domains); assign them to topics

Data in HBase

• HBase as a first choice data store:– Inherent versioning (timestamped values)– Real time access (Cache, index, key ordering)– Column oriented storage– Seamless integration with Hadoop– Big community – Production ready/mature implementation

Data model

• Data stored in one big table along with metadata and extraction results

• Though separated in column families – CFs as secondary indices

• Raw data stored in HBase ( < 10 MB, HDFS otherwise)

• Data stored as rows (versioned)• Values are typed

Types and schema, where and why

• Initially, our collections consists of purely non-structured data

• All the point for Mignify is to produce a backbone of structured information,

• Done through an iterative process which progressively builds a rich set of interrelated annotations

• Typing = important to ensure the safety of the whole process, automation

Version1:(A1.. Ak), Version2:(Ak+1, … Am),…

Data model II

<(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>v1,… vm:byte[]

<t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>

HTable

Versions

A= <CF, Qualifier>:TypeType:Writable or byte[]Schema

Resource

CF1 CF2 CFnHFiles

mignify

HBase

Extraction PlatformMain Principles

A framework for specifying data extraction from very large datasets Easy integration and application of new extractors

High level of genericity in terms of (i) data sources, (ii) extractors, and (iii) data sinks An extraction process specification combines these

elements [currently] A single extractor engine

Based on the specification, data extraction is processed by a single, generic MapReduce job.

Extraction PlatformMain Concepts

Important: typing (We care about types and schemas!) Input and Output Pipes

Declare data source (e.g., a HBase collection) and data sinks (e.g, Hbase, HDFS, csv, …)

Filters (Boolean operators that apply to input data) Extractors

Takes an input Resource, produce Features Views

Combination of input and output pipes, filters and extractors

Data Queries

• Various data sources (HTable, data files,…)• Projections using column families and qualifiers• Selections by HBase filtersFilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));

ret.addFilter(f1);

• Query results either flushed to files or back to HBase –> materialized views

• Views are defined per collection as a set of pairs of Extractors (user defined function) and Filters

Query Language (DML)

• For each employee with salary higher than 2,000, compute total costs – SELECT f(hire_date, salary) FROM employees

WHERE salary >= 2000– f(hire_date, salary)= mon(today-hire_date)*salary

• For each web page detect mime type, language– For each RSS feed, get summary,– For each HTML page extract plain text

• Currently a wizzard producing a JSON doc

User functions

• List<byte[]> f(Row r)• May calculate new attribute values, stored

with Row and reused by other function • Execution plan: order matters!• Function associates description of input and

output fields– Fields dependencies give order

Input Pipes

• Defines how to get the data – Archive files, Text files, HBase table– Format– The Mappers have always Resource on input,

several custom InputFormats and RecordReaders

Output pipes

• Defines what to do with (where to store) the query result– File, Table– Format– What columns to materialize

• Most of the time PutSortReducer used so OP defines OutputFormat and what to emit

Query Processing

• With typed data, Resource wrapper and IPs and OPs – one universal MapReduce job to execute/process queries!!

• Most efficient (for data insertion): MapReduce job with Custom Mapper and PutSortReducer

• Job init: build combined filter, IP and OP to define input and output formats

• Mapper set-up: Init plan of user function applications, init of functions themselves

• Mapper map: apply functions on a row according to plan, use OP to emit values

• Not all combinations can leverage the PutSortReducer (writing to one table at a time, …)

Query Processing II

Archive File

HBase

Map

Redu

ceHBase

ViewsFilters Extractors

Reso

urce H

File

File

Co-scanner

Data File

Data Subscriptions / Views• Data-in-a-View satisfaction can be checked at

the ingestion time, before the data is inserted• Mimicking client side co-processors – allowing

the use of bulk loading (no coprocessors for bulk load in the moment)

• When new data arrives, user functions/actions are triggered – On demand crawls, focused crawls

Second Level Triggers

• User code ran in the reduce phase (when ingesting):– Put f(Put p, Row r) – Previous versions on the input of the code Can alter the

processed Put object before final flush to HFile• Co-scanner: User-defined scanner traversing the

processed region aligned with the keys in the created HFile

• Example: Change detection on a re-crawled resource

Aggregation Queries• Compute frequency of mime types in the

collection• For a web domain, compute spammicity using

word distribution and out-links• Based on a two-step aggregation process– Data is extracted in the Map(), emitted with the

View signature– Data is collected in the Reduce(), grouped on the

combination (view sig., aggr. key) and aggregated.

Aggregation Queries Processing

• Processed using MapReduce, multiple (compatible) aggregations at once (reading is the most expensive)

• Aggregation map phase: List<Pairs> map(Row r), Pair=<Agg_key, value>, Agg_key=<agg_id, agg_on_value,…>

• Aggregation reduce phase: reduce(Agg_key, values, context)

Aggregation Processing Numbers

Compute mime type distribution of web pages per PLD:SELECT pld, mime, count(*) FROM web_collection

GROUP BY extract_pld(url), mime

Data Ingestion

Our crawler asynchronously writes to Hbase:

Input: archive files (ARC, WARC) in HDFSOutput: Htable

SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz

1. Pre-compute the split region boundaries on a data sample– MapReduce on a data input sample

2. Process batch (~0.5TB) MapReduce ingestion3. Split manually too big regions (or candidates)4. If there is still input go to 2.

Data Ingestion Numbers

• Store indexable web resources from WARC files to HBase, detect mime type, language, extract plain text and analyze RSS feeds

• Reaching steady 40MB/s including extraction • Upper bound 170MB/s (distributed reading of

archive files in HDFS)• HBase is idle most of the time!– Allows compacting store files in the meanwhile

Web Archive Collection

• Column families (basic set): 1. Raw content (payload)2. Meta data (mime, IP, …)3. Baseline analytics (plain text, detected mime, …)

• Usually one another CF per analytics result• CFs as secondary indices:– All analyzed feeds at one place (no need for filter

if I am interested in all such rows)

Web Archive Collection II

• More than 3,000 regions (in one collection)• 12TB of compressed of indexable data (and

counting)• Crawl to store/process machine ratio is 1:1.2• Storage scales out

HW Architecture

• Tens of small low-consumption nodes with a lot of disk space:– 15TB per node, 8GB RAM, dual core CPU– No enclosure -> no active cooling -> no expensive

datacenter-ish environment needed• Low per PB storage price (70 nodes/PB), car

batteries as UPS, commodity (real low-cost) HW (esp. disks)

• Still some reasonable computational power

Conclusions

• Conclusions– Data refinery platform– Customizable, extensible– Large scale

• Future work– Incorporating external secondary indices to filter HBase

rows/cells• Full text index filtering• Temporal filtering

– Larger (100s TBs) scale deployment

Acknowledgments

• European Union projects:– LAWA: Longitudinal Analytics of Web Archive data– SCAPE - SCAlable Preservation Environments

Technology

HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research