Distributed Indexing of Web Scale Datasets for the Cloud Email:{ikons, eangelou, dtsouma}@cslab.ece.ntua.gr Computing Systems Laboratory School of Electrical

Distributed Indexing of Web Scale Datasets for the Cloud

Email:{ikons, eangelou, dtsouma}@cslab.ece.ntua.gr

Computing Systems Laboratory School of Electrical and Computer Engineering

National Technical University of Athens

Ioannis Konstantinou Evangelos Angelou

Dimitrios Tsoumakos

Problem

• Increasing data volume (e-mail-web logs, historical data, click streams) pushes classic RDBMS to their limits.

• Centralized indices are slow to create and can’t scale to a large number of concurrent requests.

• Current MapReduce based data analysis tools (e.g. Hive, Pig) do not provide near real time access to user queries.

NoSQL Systems

• NoSQL databases: horizontal scalable, distributed non-relational data stores.

• Simple but fast queries.• Relaxed ACID guarantees. • Examples: Google's Bigtable, Amazon's

Dynamo, Facebook's Cassandra and LinkedIn's Voldermort.

• Perfect candidates for cloud infrastructures: Shared nothing arch. enables elastic scalability

Our contribution

• A Distributed processing framework to index, store and serve large amounts of content data under heavy request loads.

• Content users provide the raw data along with simple indexing rules.

• NoSQL and MapReduce combination:• MapReduce jobs process input to create index.

• Index and content is served through a NoSQL system.

Goals• Support of almost any type of data

– Unstructured, semi-structured and fully structured.

• Near real-time query response times– Query execution times should be in the order of

milliseconds.

• Scalability (preferably elastic)– Both in terms of storage space and concurrent user

requests.

• Ease of use– Simple index rules.

Architecture

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Row content is uploaded to HDFS.• Content with index rules is fed to the Uploader, to create

the Content table.• The Content table is fed to the Indexer that extracts the

Index table.• The client api contacts the index table to perform

searches, and the content table to serve objects.

Architecture - Index rules

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Instructions of what to index.• Specify record boundaries to split input into

distinct entities.• Select content regions to index (granularity).

Uploader class

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Crunches data input to create the Content table using MapReduce.

• Mappers read input records, and create Hbase rows (one for each record).

• Reducers sort row and colums and write back to HDFS in Hfiles.• Hbase API bypassed for speed reasons.• MD5Hash total order partitioner.

Content table

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Row key: MD5Hash of the record content.• Column ids: granularity levels with increment number.• Row key and column id specify an HBase cell.• Cell values contain the content to be indexed.• Specific cell per each record contains the content that

will not be indexed.

Indexer class

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Creates an inverted list of index terms and term locations (index table).

• Mapper input is the content table and output is Hbase cells of the index table.

• Reducers sort Hbase cells (input) and create appropriate Hfiles (output).

• SimpleTotalOrder Partitioner.

Index table

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Row key: index term followed by granularity (eg key is google_revision if “google” was found on a revision tag.)

• Column ids: Row key of content table (MD5Hash) along with the granularity increment number that points to the exact cell in the content record.

Client API

Raw Content

Uploader

Index rulesContent

tableMapReduce

MapReduce

Indextable

Indexer

Searchobjects

Getobject

ClientAPI

• Search for a keyword using Index table– Select level of granularity (Hbase Get), or search for

all levels (Hbase Scan).

• Retrieve object from Content table– Using a simple Hbase Get operation

Xml indexer

• Users provide – A specific tagname used to split records.– A comma separated list of tagnames.

MySQL indexer

• Assumptions: dataset in two “splits”– Database description (e.g. obtained from mysqldump

using the –no-data option)– Full data in single row dump

• Retains original information from MySQL schema to allow similar queries

• Follows similar conventions for indexing as the XML indexer, allowing searching with the same queries, without a priori knowledge as to the source of the data.

Experiments

• Indexing speed• Time vs dataset size: How well does it scale

for big data? Currently testing with 23G of Wikipedia data – planning 1.7 TB test with full Wikipedia dump. Intermediate sizes can be achieved by manually splitting the datasets.

Experiments (2)

• Time vs number of nodes: What is the speedup achieved by adding more nodes? Current tests were performed on two Hadoop clusters – clones with 10 nodes and xenons with 5 nodes. More nodes are needed…

• Time vs Hbase setup: The experiments were performed on “vanilla” installations of Hbase. Speed benefits can be achieved by tailoring the installation to our needs?

Future developments

• Index aggregation• Indexing multiple Hbase tables, or even tables

from different Hbase masters, thus creating a central repository for similar (and not similar) information for an organization.

• Technical challenge: creating an algorithm that can find similarities between the tables, thus allowing “smarter” queries.

Questions

Documents

Distributed Indexing of Web Scale Datasets for the Cloud Email:{ikons, eangelou, dtsouma}@cslab.ece.ntua.gr Computing Systems Laboratory School of Electrical