Upload
joydeep-sen-sarma
View
7.612
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Talk at the CS department in IIT 04/02/09.
Citation preview
Hadoop and Hive
Large Scale Data Processing using Commodity HW/SW
Joydeep Sen Sarma
Outline
Introduction Hadoop Hive Hadoop/Hive Usage @Facebook Wishlists/Projects Questions
Data and Computing Trends
Explosion of Data– Web Logs, Ad-Server logs, Sensor Networks, Seismic Data, DNA sequences
(?)– User generated content/Web 2.0– Data as BI => Data as product (Search, Ads, Digg, Quantcast, …)
Declining Revenue/GB– Milk @ $3/gallon => $15M / GB– Ads @ 20c / 10^6 impressions => $1/GB – Google Analytics, Facebook Lexicon == Free!
Hardware Trends– Commodity Rocks: $4K 1U box = 8 cores + 16GB mem + 4x1TB– CPU: SMP NUMA, Storage: $ Shared-Nothing << $ Shared, Networking:
Ethernet
Software Trends– Open Source, SaaS
– LAMP, Compute on Demand (EC2)
Hadoop
Parallel Computing platform– Distributed FileSystem (HDFS)– Parallel Processing model (Map/Reduce)– Express Computation in any language– Job execution for Map/Reduce jobs
(scheduling+localization+retries/speculation)
Open-Source – Most popular Apache project!– Highly Extensible Java Stack (@ expense of Efficiency)– Develop/Test on EC2!
Ride the commodity curve:– Cheap (but reliable) shared nothing storage– Data Local computing (don’t need high speed networks)– Highly Scalable (@expense of Efficiency)
Looks like this ..
Disks
Node
Disks
Node
Disks
Node
Disks
Node
Disks
Node
Disks
Node
1 Gigabit 4-8 Gigabit
Node=
DataNode +
Map-Reduce
HDFS Separation of Metadata from Data
– Metadata == Inodes, attributes, block locations, block replication
File = Σdata blocks (typically 128MB)– Architected for large files and streaming reads
Highly Reliable- Each data block typically replicated 3X to different datanodes- Clients compute and verify block checksums (end-to-end)
Single namenode – All metadata stored In-memory. Passive standby
Client talks to both namenode and datanodes– Bulk data from datanode to client linear scalability– Custom Client library in Java/C/Thrift– Not POSIX, not NFS
In pictures ..
NameNode
Disks
32GBRAM
SecondaryNameNod
e
Disks
32GBRAM
DataNode
DataNode
DataNode
DFS Client
DataNode
DataNode
DataNode
getLocations
locations
Map and Reduce
Map Function:– Apply to input data– Emits reduction key and value
Reduce Function:– Apply to data grouped by reduction key– Often ‘reduces’ data (for example – sum(values))
Hadoop groups data by sorting User can choose to apply reductions multiple times
– Combiner
Partitioning, Sorting, Grouping different concepts
Programming with Map/Reduce
Find the most imported package in Hive source:$ find . -name '*.java' -exec egrep '^import' '{}' \; | awk '{print $2}' | sort | uniq -c | sort -nr +0 -1 | head -1 208 org.apache.commons.logging.LogFactory;
In Map-Reduce:1a. Map using: egrep '^import'| awk '{print $2}'
1b. Reduce on first column (package name)1c. Reduce Function: uniq -c
2a. Map using: awk ‘{print “%05d\t%s\n”,100000-$1,$2}’
2b. Reduce using first column (inverse counts), 1 reducer2c. Reduce Function: Identity
Scales to Terabytes
Map/Reduce DataFLow
Why HIVE?
Large installed base of SQL users – ie. map-reduce is for ultra-geeks– much much easier to write sql query
Analytics SQL queries translate really well to map-reduce
Files as insufficient data management abstraction– Tables, Schemas, Partitions, Indices– Metadata allows optimization, discovery, browsing
Love the programmability of Hadoop Hate that RDBMS are closed
– Why not work on data in any format?– Complex data types are the norm
Rubbing it in ..
hive> select key, count(1) from kv1 where key > 100 group by key;
vs.
$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
HIVE: Components
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgm
t. W
eb
UI
Data Model
Logical Partitioning
Hash Partitioning
Schema
Library
clicks
HDFS MetaStore
/hive/clicks/hive/clicks/ds=2008-03-25
/hive/clicks/ds=2008-03-25/0
…
Tables
#Buckets=32Bucketing InfoPartitioning Cols
Hive Query Language
Basic SQL– From clause subquery– ANSI JOIN (equi-join only)– Multi-table Insert– Multi group-by– Sampling– Objects traversal
Extensibility– Pluggable Map-reduce scripts using TRANSFORM
Running Custom Map/Reduce Scripts
FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date)
USING 'map_script' AS(dt, uid) CLUSTER BY(dt)) map
INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING
'reduce_script' AS (date, count);
Hive QL – Join
• SQL:INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid
age gender
111 25 female
222 32 male
pageid
age
1 25
2 25
1 32
X =
page_viewuser
pv_users
Hive QL – Join in Map Reduce
key value
111 <1,1>
111 <1,2>
222 <1,1>
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14useri
dage gende
r
111 25 female
222 32 male
page_view
user
key value
111 <2,25>
222 <2,32>
Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
ShuffleSort
pageid
age
1 25
2 25
pageid
age
1 32
Reduce
Joins
• Outer JoinsINSERT INTO TABLE pv_users SELECT pv.*, u.gender, u.age
FROM page_view pv FULL OUTER JOIN user u ON (pv.userid = u.id)
WHERE pv.date = 2008-03-03;
Hive Optimizations – Merge Sequential Map Reduce Jobs
SQL:– FROM (a join b on a.key = b.key) join c on a.key = c.key
SELECT …
key
av bv
1 111
222
key av
1 111
A
Map Reducekey bv
1 222
B
key cv
1 333
C
AB
Map Reducekey
av bv cv
1 111
222 333
ABC
Join To Map Reduce
Only Equality Joins with conjunctions supported Future
– Pruning of values send from map to reduce on the basis of projections
– Make Cartesian product more memory efficient– Map side joins
Hash Joins if one of the tables is very small Exploit pre-sorted data by doing map-side merge join
Estimate number of reducers– Hard to measure effect of filters– Run map side for small part of input to estimate #reducers
Hive QL – Group By
SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;
pageid
age
1 25
2 25
1 32
2 25
pv_users
pageid
age count
1 25 1
2 25 2
1 32 1
Hive QL – Group By in Map Reduce
pageid
age
1 25
2 25
pv_users
pageid
age count
1 25 1
1 32 1
pageid
age
1 32
2 25
Map
key value
<1,25>
1
<2,25>
1
key value
<1,32>
1
<2,25>
1
key value
<1,25>
1
<1,32>
1
key value
<2,25>
1
<2,25>
1
ShuffleSort
pageid
age count
2 25 2
Reduce
Hive QL – Group By with Distinct
SELECT pageid, COUNT(DISTINCT userid)FROM page_view GROUP BY pageid
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
2 111 9:08:20
page_view
pageid count_distinct_userid
1 2
2 1
Hive QL – Group By with Distinct in Map Reduce
page_view
pageid
count
1 1
2 1ShuffleandSort
pageid
count
1 1
Reduce
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
pageid
userid
time
1 222 9:08:14
2 111 9:08:20
key v
<1,111>
<2,111>
<2,111>key v
<1,222>
MapReduce
Group by Future optimizations
Map side partial aggregations– Hash Based aggregates (Done)– Serialized key/values in hash tables– Exploit pre-sorted data for distinct counts
Partial aggregations in Combiner Be smarter about how to avoid multiple stage (Done) Exploit table/column statistics for deciding strategy
Dealing with Structured Data
Type system– Primitive types
– Recursively build up using Composition/Maps/Lists
ObjectInspector interface for user-defined types– To recursively list schema
– To recursively access fields within a row object
Generic (De)Serialization Interface (SerDe)
Serialization families implement interface– Thrift DDL based SerDe
– Delimited text based SerDe
– You can write your own SerDe (XML, JSON …)
MetaStore
Stores Table/Partition properties:– Table schema and SerDe library– Table Location on HDFS– Logical Partitioning keys and types– Partition level metadata– Other information
Thrift API– Current clients in Php (Web Interface), Python interface to
Hive, Java (Query Engine and CLI)
Metadata stored in any SQL backend Future
– Statistics– Schema Evolution
Future Work
Cost-based optimization Multiple interfaces (JDBC…)/Integration with BI SQL Compliance (order by, nested queries…) Indexing Data Compression
– Columnar storage schemes – Exploit lazy/functional Hive field retrieval interfaces
Better data locality– Co-locate hash partitions on same rack– Exploit high intra-rack bandwidth for merge joins
Advanced operators– Cubes/Frequent Item Sets
Hive Status
Available Hadoop Sub-project- http://svn.apache.org/repos/asf/hadoop/hive/trunk
[email protected] IRC: #hive VLDB demo submission
Hivers@Facebook:– Ashish Thusoo– Zheng Shao– Prasad Chakka– Namit Jain– Raghu Murthy– Suresh Anthony
Hive/Hadoop Usage @ Facebook
Summarization – Eg: Daily/Weekly aggregations of impression/click counts– Complex measures of user engagement
Ad hoc Analysis– Eg: how many group admins broken down by state/country
Data Mining (Assembling training data)– Eg: User Engagement as a function of user attributes
Spam Detection– Anomalous patterns in UGC– Application api usage patterns
Ad Optimization Too many to count ..
Data Warehousing at Facebook Today
Web Servers Scribe Servers
Filers
Hive on Hadoop ClusterOracle RAC Federated MySQL
Hadoop Usage @ Facebook
Data statistics:– Total Data: ~2.5PB – Net Data added/day: ~15TB
6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily
– Compression Factor ~5x (gzip, more with bzip)
Usage statistics:– 3200 jobs/day with 800K tasks(map-reduce tasks)/day– 55TB of compressed data scanned daily– 15TB of compressed output data written to hdfs– 80 MM compute minutes/day
In Pictures
Hadoop Challenges @ Facebook
QOS/Isolation:– Big jobs can hog the cluster– JobTracker memory as limited resource– Limit memory impact of runaway tasksFair Scheduler (Matei)
Protection– What if software bug/disaster destroys NameNode
metadata?HDFS SnapShots (Dhruba)
Data Archival– Not all data is hot and needs colocation with ComputeHadoop Data Archival Layer
Hadoop Challenges @ Facebook
Performance– Really hard to understand what systemic bottlenecks are– Workloads are variable during daytime
Small Job Performance– Sampling encourage small test queries– Hadoop awful at locality for small jobs– Need to reduce task startup time (JVM reuse)– Large number of mappers each with small output produces
terrible performance Global Scheduling for better locality (Matei)
Hadoop Wish List
HDFS:– 3-way replication is unsustainable
N+k erasure codes
– Snapshots Design out – but need people to work on it
– Namenode on-disk metadata In-memory model poses fundamental limits on growth
– Application hints on block/file co-location
Map-Reduce– Performance– Resource aware scheduling– Multi-Stage Map-Reduce