Extreme computing Databases and MapReduce · Extreme computing Databases and MapReduce Stratis D. Viglas School of Informatics University of Edinburgh Stratis D. Viglas Databases

Extreme computingDatabases and MapReduce

Stratis D. Viglas

School of InformaticsUniversity of Edinburgh

Stratis D. Viglas www.inf.ed.ac.uk

Databases and MapReduce BigTable

Outline

Databases and MapReduceOverviewRelational databasesRelational data processing on Hadoop MRBigTableHive and Pig



A different data model

• BigTable’s data model is not relational• A table is “a sparse, distributed, persistent multidimensional sorted

map”• The map is indexed by a triplet

• (row:string, column:string, time:int64)• row and column are keys, time is a timestamp

• Bigtables are mutable at the row level• Support for insertions, deletions, lookups



Rows and columns in more detail

"<html>..."

"<html>..."

"<html>..."

"CNN" "CNN.com"

t3t5

t6

t9 t8com.cnn.www

contents: anchor:cnnsi.com anchor:my.look.ca

• Rows are maintained in sorted lexicographic order• Applications can exploit this property for efficient row scans• Row ranges dynamically partitioned into tablets

• Columns grouped into column families• Column key = family:qualifier• Column families provide locality hints• Unbounded number of columns per table



Building blocks: SSTable

• The smallest and most basic building block• Persistent immutable map from keys to values

• Stored in GFS• Sequence of disk blocks with a (persistent) index for lookup• Memory-mapped for fast operation

• Two supported operations• Given a key, look up the value associated with it• Iterate over key/value pairs within a given key range

64kBblock

64kBblock

64kBblock

Index

SSTable



Building blocks: Tablets and Tables

• Dynamically partitioned range of rows• Built from multiple SSTables

64kBblock

64kBblock

64kBblock

Index

SSTable

64kBblock

64kBblock

64kBblock

Index

SSTable

Tablet start: aardvark end: apple

• Multiple tablets make up a table• SSTables can be shared beween tablets

SSTable

Tabletaardvark apple

SSTable SSTable SSTable

Tabletapplepie boat



Notes on the architecture

• Similar to GFS

• Single master server, multiple tablet servers

• BigTable master

• Assigns tablets to tablet servers

• Detects addition and expiration of tablet servers

• Balances tablet server load

• Handles garbage collection

• Handles schema evolution

• Bigtable tablet servers

• Each tablet server manages a set of tablets

• Typically between ten to a thousand tablets

• Each 100− 200MB by default

• Handles read and write requests to the tablets

• Splits tablets when they grow too large



Location dereferencing

Chubby file ...

...

...

...

...

...

...

...

...

...

...

Other metadatatablets

Root tablet(1st metadata level)master file

User table 1

User table nchubby: replicated, persistent lock service; maintains tablet server locations

root tablet: root of the metadata tree

at most three levels in the metadata hierarchy

B-tree like structure, indexed by table identifier and end row



Tablet assignment

• Master keeps track of• Set of live tablet servers• Assignment of tablets to tablet servers• Unassigned tablets

• Each tablet is assigned to one tablet server at a time• Tablet server maintains an exclusive lock on a file in Chubby• Master monitors tablet servers and handles assignment

• Changes to tablet structure• Table creation/deletion (master initiated)• Tablet merging (master initiated)• Tablet splitting (tablet server initiated)



Tablet serving and I/O flow

SSTable SSTable SSTable

memtable read

write

memory

GFS

tablet log

write operations arelogged (in redo records)

recent updates kept sorted in main memory

memtable and SSTablesare merged to servethe read request



Tablet management

• Minor compaction• Converts the memtable into an SSTable• Reduces memory footprint and log traffic on restart

• Merging compaction• Reads the contents of a few SSTables and the memtable, and writes

out a new SSTable• Reduces number of SSTables

• Major compaction• Merging compaction that results in only one SSTable• No deletion records, only live data


Databases and MapReduce Hive and Pig

Outline

Databases and MapReduceOverviewRelational databasesRelational data processing on Hadoop MRBigTableHive and Pig



High-level data processing

• Hive: data warehousing application in Hadoop

• Query language is HQL, variant of SQL

• Tables stored on HDFS as flat files

• Developed by Facebook, now open source

• Pig: large-scale data processing system

• Scripts are written in Pig Latin, a dataflow language

• Developed by Yahoo!, now open source

• Roughly 1/3 of all Yahoo! internal jobs

• Common idea

• Provide higher-level language to facilitate large-data processing

• Higher-level language is compiled to Hadoop jobs



Hive: background and components

• Started at Facebook1

• Data was collected by nightly cron jobs into Oracle DB• Extract-transform-load (ETL) via hand-coded python• Grew from 10s of GBs (2006) to 1TB/day new data (2007), now 10x that

• Shell: allows interactive queries• Driver: session handles, fetch, execute• Compiler: parse, plan, optimize• Execution engine: DAG of stages (MR, HDFS, metadata processing)• Metastore: schema, location in HDFS, SerDe

1It had to be good for something apart from wasting my PhD students’ timeStratis D. Viglas www.inf.ed.ac.uk


Logical and physical models

• Tables

• Typed columns (int, float, string, boolean)

• Also: list, map

• Partitions

• For example, range-partition tables by date

• Buckets

• Hash partitions within ranges (useful for sampling, join optimization)

• Metastore

• Database: namespace containing a set of tables

• Holds table definitions (column types, physical layout)

• Holds partitioning information

• Can be stored in Derby, MySQL, and many other relational databases



Hive processing

• Hive uses HQL, a declarative query language close to SQL

• HQL statements are translated into a syntax tree

• Syntax tree is compiled into an execution plan of MapReduce jobs,

executed by Hadoop

SELECT s.word, s.freq, k.freqFROM shakespeare s JOIN bible kON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 10;

HQL query Abstract Syntax Tree

map

reduce

map

reduce

map

reduce

map

reduce

map

reduce

map

reduce

MapReduce plan



Pig and Pig Latin

• Similar idea to Hive, but more tailored towards efficiency and a

DB-like setting

• Script interface to deploy MapReduce jobs

• Maintains schema and performs type checking

• Rudimentary optimiser to translate Pig scripts into an efficient

physical dataflow

• Sequence of one or more MapReduce jobs

• Exploit heuristics and cost model to reduce intermediate data

• Dataflow is scheduled and executed

• Runtime tracks job progress and any errors



Example Pig Latin script

Visits = load ’/data/visits’ as (user, url, time);Visits = foreach Visits

generate user, Canonicalize(url), time;

Pages = load ’/data/pages’ as (url, pagerank);

VP = join Visits by url, Pages by url;UserVisits = group VP by user;UserPageranks = foreach UserVisits

generate user, AVG(VP.pagerank) as avgpr;

GoodUsers = filter UserPageranks by avgpr > ’0.5’;store GoodUsers into ’/data/good_users’;



Java vs. Pig Latin

20406080

100120140160180

Hadoop Pig

lines

of c

ode

50

100

150

200

250

300

Hadoop Pig

min

utes

• Performance on par with raw Hadoop

• But with 1/20 of the lines of code

• And with 1/16 of the developement time


Documents

Extreme computing Databases and MapReduce · Extreme computing Databases and MapReduce Stratis D. Viglas School of Informatics University of Edinburgh Stratis D. Viglas Databases