View
216
Download
0
Category
Tags:
Preview:
Citation preview
Unmodified Hadoop Aim at workload
where the data fit s a star schema
Draw on existing techniques: columnar storage, tailored join plans, block iteration
Introduction
InputFormats and OutputFormats InputFormat implements two methods: getSplits(),
getRecordReader()
MapRunners Schedules JVM reuse
Background
Avoid I/O for columns that are not used Store each column in separate HDFS file ColumnInputFormat ensures that different
columns in a row are co-located at datanode
Columnar Storage
Sql-like structured data processing Map phase is responsible for joining the fact
table with the dimension tables Reduce phase is responsible for the grouping
and aggregation
Join Strategy
Map phase Build hashtable for each dimension table using
predicates Maptask checks whether the input was in the
hashtables Output the record that satisfies the join conditions Key from the subset of columns needed for grouping.
Reduce phase Aggregate the values of the same key
Sort at client
Execution process
Exploit multi-core parallelism Single map task per node Uses a custom MapRunner class to run a multi-threaded
map task Using MultiCIF packs multiple input splits into a single
multi-split Shared across consecutive map tasks that run on the
same node
Task scheduling Block iteration
Optimizing for the Native Implementation
Schedule only one map task from the join job on a given node
Schedule subsequent map tasks on the node where the dimension hash table has already been built
Communicate to the map task the number of slots, or processor cores it can use on the node
Task Scheduling
Support two join plans: repartition join, mapjoin
Reparttion join is a robust technique that works with any combination of sizes of tables
Mapjoin is designed for one table that is significantly samller than the other
Hive Background
Cluster Cluster A : 9 nodes, two quad-core processors,16G
memory, 8*250G disk, 1G ethernet switch Cluster B: 42nodes, two quad-core processors, 32G
memory 5*500G disk 1G ethernet switch Clydesdale on hadoop 0.21 and Hive on
hadoop0.20.2 Workload Storage Format: Clydesdale fact tables were
stored in Multi-CIF, Hive is RCFile
Experimental Setup
Hive joins one dimension table at a time with the fact table
Hive maintains many copies of the hash table
Hive creates the hash tables on a single node and pays the cost of disseminating it to the entire cluster
Each task in Hive has to load and deserialize the hash table when it starts.
Result Analysis
Recommended