© Hortonworks Inc. 2013 Page 1
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha@bikassaha
© Hortonworks Inc. 2013
Tez – Introduction
Page 2
• Distributed execution framework targeted towards data-processing applications.
• Based on expressing a computation as a dataflow graph.
• Highly customizable to meet a broad spectrum of use cases.
• Built on top of YARN – the resource management framework for Hadoop.
• Open source Apache incubator project and Apache licensed.
© Hortonworks Inc. 2012© Hortonworks Inc. 2013. Confidential and Proprietary.
Tez – Hadoop 1 ---> Hadoop 2
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& data processing)
Pig(data flow)
Hive(sql)
Others(cascading)
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Tez(execution engine)
HADOOP 2.0
Data FlowPig
SQLHive
Others(Cascading)
BatchMapReduce Real Time
Stream Processing
Storm
Online Data
ProcessingHBase,
Accumulo
Monolithic• Resource Management• Execution Engine• User API
Layered• Resource Management – YARN• Execution Engine – Tez• User API – Hive, Pig, Cascading, Your App!
© Hortonworks Inc. 2013
Tez – Empowering Applications
Page 4
• Tez solves hard problems of running on a distributed Hadoop environment• Apps can focus on solving their domain specific problems• This design is important to be a platform for a variety of applications
App
Tez
• Custom application logic• Custom data format• Custom data transfer technology
• Distributed parallel execution• Negotiating resources from the Hadoop framework• Fault tolerance and recovery• Horizontal scalability• Resource elasticity• Shared library of ready-to-use components• Built-in performance optimizations• Security
© Hortonworks Inc. 2013
Tez – End User Benefits
• Better performance of applications• Built-in performance + Application define optimizations
• Better predictability of results• Minimization of overheads and queuing delays
• Better utilization of compute capacity• Efficient use of allocated resources
• Reduced load on distributed filesystem (HDFS)• Reduce unnecessary replicated writes
• Reduced network usage• Better locality and data transfer using new data patterns
• Higher application developer productivity• Focus on application business logic rather than Hadoop internals
Page 5
© Hortonworks Inc. 2013
Tez – Design considerations
Don’t solve problems that have already been solved. Or else you will have to solve them again!
• Leverage discrete task based compute model for elasticity, scalability and fault tolerance
• Leverage several man years of work in Hadoop Map-Reduce data shuffling operations
• Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN
• Leverage built-in security mechanisms in Hadoop for privacy and isolation
Page 6
Look to the Future with an eye on the Past
© Hortonworks Inc. 2013
Tez – Problems that it addresses
• Expressing the computation• Direct and elegant representation of the data processing flow• Interfacing with application code and new technologies
• Performance• Late Binding : Make decisions as late as possible using real data from at
runtime• Leverage the resources of the cluster efficiently• Just work out of the box!• Customizable engine to let applications tailor the job to meet their
specific requirements
• Operation simplicity• Painless to operate, experiment and upgrade
Page 7
© Hortonworks Inc. 2013
Tez – Simplifying Operations
• No deployments to do. No side effects. Easy and safe to try it out!• Tez is a completely client side application. • Simply upload to any accessible FileSystem and change local Tez
configuration to point to that.• Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.• Leverages YARN local resources.
Page 8
ClientMachine
NodeManager
TezTask
NodeManager
TezTaskTezClient
HDFSTez Lib 1 Tez Lib 2
ClientMachine
TezClient
© Hortonworks Inc. 2013
Tez – Expressing the computation
Page 9
Aggregate Stage
Partition Stage
Preprocessor Stage
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers
© Hortonworks Inc. 2013
Tez – Expressing the computation
Page 10
Tez provides the following APIs to define the processing• DAG API
• Defines the structure of the data processing and the relationship between producers and consumers
• Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime
• This is how all the tasks in the job get specified
• Runtime API• Defines the interfaces using which the framework and app code interact
with each other• App code transforms data and moves it between tasks• This is how we specify what actually executes in each task on the cluster
nodes
© Hortonworks Inc. 2013
Tez – DAG API
// Define DAGDAG dag = new DAG();
// Define VertexVertex Map1 = new Vertex(Processor.class);
// Define EdgeEdge edge = Edge(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class);
// Connect themdag.addVertex(Map1).addEdge(edge)…
Page 11
Defines the global processing flow
Map1 Map2
Reduce1 Reduce2
Join
Scatter Gather
Scatter Gather
© Hortonworks Inc. 2013
Tez – DAG API
Page 12
• Data movement – Defines routing of data between tasks– One-To-One : Data from the ith producer task routes to the ith consumer task.– Broadcast : Data from a producer task routes to all consumer tasks.– Scatter-Gather : Producer tasks scatter data into shards and consumer tasks
gather the data. The ith shard from all producer tasks routes to the ith consumer task.
• Scheduling – Defines when a consumer task is scheduled– Sequential : Consumer task may be scheduled after a producer task completes.– Concurrent : Consumer task must be co-scheduled with a producer task.
• Data source – Defines the lifetime/reliability of a task output– Persisted : Output will be available after the task exits. Output may be lost later
on.– Persisted-Reliable : Output is reliably stored and will always be available– Ephemeral : Output is available only while the producer task is running
Edge properties define the connection between producer and consumer tasks in the DAG
© Hortonworks Inc. 2013
Tez – Logical DAG expansion at Runtime
Page 13
Reduce1
Map2
Reduce2
Join
Map1
© Hortonworks Inc. 2013
Tez – Runtime API
Flexible Inputs-Processor-Outputs Model• Thin API layer to wrap around arbitrary application code• Compose inputs, processor and outputs to execute arbitrary processing• Event routing based control plane architecture• Applications decide logical data format and data transfer technology• Customize for performance• Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService.
Built on the same API. Your impls are as first class as ours!
Page 14
© Hortonworks Inc. 2013
Tez – Library of Inputs and Outputs
Page 15
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for Map-Reduce-Reduce
Map Processor
HDFS Input
Sorted Output
Reduce Processor
Shuffle Input
HDFS Output
Reduce Processor
Shuffle Input
Sorted Output
• What is built in?– Hadoop InputFormat/OutputFormat– SortedGroupedPartitioned Key-Value
Input/Output– UnsortedGroupedPartitioned Key-Value
Input/Output– Key-Value Input/Output
© Hortonworks Inc. 2013
Tez – Composable Task Model
Page 16
Hive Processor
HDFSInput
RemoteFile
ServerInput
HDFSOutput
LocalDisk
Output
Your Processor
HDFSInput
RemoteFile
ServerInput
HDFSOutput
LocalDisk
Output
Your Processor
RDMAInput
NativeDB
Input
KakfaPub-SubOutput
AmazonS3
Output
Adopt Evolve Optimize
© Hortonworks Inc. 2013
Tez – Performance
• Benefits of expressing the data processing as a DAG• Reducing overheads and queuing effects• Gives system the global picture for better planning
• Efficient use of resources• Re-use resources to maximize utilization• Pre-launch, pre-warm and cache• Locality & resource aware scheduling
• Support for application defined DAG modifications at runtime for optimized execution
• Change task concurrency • Change task scheduling• Change DAG edges• Change DAG vertices
Page 17
© Hortonworks Inc. 2013
Tez – Benefits of DAG execution
Faster Execution and Higher Predictability• Eliminate replicated write barrier between successive computations.• Eliminate job launch overhead of workflow jobs.• Eliminate extra stage of map reads in every workflow job.• Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.• Better locality because the engine has the global picture
Page 18
Pig/Hive - MRPig/Hive - Tez
© Hortonworks Inc. 2013
Tez – Container Re-Use
• Reuse YARN containers/JVMs to launch new tasks• Reduce scheduling and launching delays• Shared in-memory data across tasks• JVM JIT friendly execution
Page 19
YARN Container / JVM
TezTask Host
TezTask1
TezTask2
Shar
ed O
bjec
ts
YARN Container
Tez Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2013
Tez – Sessions
Page 20
Application Master
Client
Start Session
Submit DAG
Task Scheduler
Cont
aine
r Poo
lShared Object
Registry
Pre Warmed
JVM
Sessions• Standard concepts of pre-launch
and pre-warm applied• Key for interactive queries• Represents a connection between
the user and the cluster• Multiple DAGs executed in the
same session• Containers re-used across queries• Takes care of data locality and
releasing resources when idle
© Hortonworks Inc. 2013
Tez – Re-Use in Action
Task Execution Timeline
Cont
aine
rs
Time
© Hortonworks Inc. 2013
Tez – Customizable Core Engine
Page 22
Vertex-2
Vertex-1
Startvertex
Vertex Manager
Starttasks
DAGScheduler
Get Priority
Get Priority
Startvertex
TaskScheduler
Get container
Get container
• Vertex Manager• Determines task
parallelism• Determines when
tasks in a vertex can start.
• DAG SchedulerDetermines priority of task
• Task SchedulerAllocates containers from YARN and assigns them to tasks
© Hortonworks Inc. 2013
Tez – Event Based Control Plane
Page 23
Reduce Task 2
Input1 Input2
Map Task 2Output1
Output2Output3
Map Task 1Output1
Output2Output3
AMRouter
Scatter-Gather Edge
• Events used to communicate between the tasks and between task and framework
• Data Movement Event used by producer task to inform the consumer task about data location, size etc.
• Input Error event sent by task to the engine to inform about errors in reading input. The engine then takes action by re-generating the input
• Other events to send task completion notification, data statistics and other control plane information
Data Event
Error Event
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 24
Map Vertex
Reduce VertexApp Master
Vertex ManagerData Size Statistics
Vertex StateMachine
Set Parallelism
Cancel Task
Re-Route
Event ModelMap tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Reduce Slow Start/Pre-launch
Page 25
Map Vertex
Reduce VertexApp Master
Vertex ManagerTask Completed
Vertex StateMachine
Start Tasks
Start
Event ModelMap completion events sent to the Reduce Vertex Manager.
Vertex ManagerPluggable application logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start.
© Hortonworks Inc. 2013
Tez – Theory to Practice
• Performance• Scalability
Page 26
© Hortonworks Inc. 2013
Tez – Hive TPC-DS Scale 200GB latency
Replicated Join (2.8x)
Join + Groupby
(1.5x)
Join + Groupby +
Orderby (1.5x)
3 way Split + Join +
Groupby + Orderby
(2.6x)
0500
100015002000250030003500400045005000
MRTez
Tim
e in
se
cs
Tez – Pig performance gains
• Demonstrates performance gains from a basic translation to a Tez DAG
• Deeper integration in the works for further improvements
© Hortonworks Inc. 2013
Tez – Observations on Performance
• Number of stages in the DAG• Higher the number of stages in the DAG, performance of Tez (over MR) will
be better.• Cluster/queue capacity
• More congested a queue is, the performance of Tez (over MR) will be better due to container reuse.
• Size of intermediate output• More the size of intermediate output, the performance of Tez (over MR) will
be better due to reduced HDFS usage.• Size of data in the job
• For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs.
• Offload work to the cluster• Move as much work as possible to the cluster by modelling it via the job
DAG. Exploit the parallelism and resources of the cluster. E.g. MR split calculation.
• Vertex caching• The more re-computation can be avoided the better is the performance.
Page 29
© Hortonworks Inc. 2013
Tez – Data at scale
Page 30
Hive TPC-DS Scale 10TB
© Hortonworks Inc. 2013
Tez – DAG definition at scale
Page 31
Hive : TPC-DS Query 64 Logical DAG
© Hortonworks Inc. 2013
Tez – Container Reuse at Scale
78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4)
Page 32
© Hortonworks Inc. 2013
Tez – Real World Use Cases for the API
Page 33
© Hortonworks Inc. 2013
Tez – Broadcast EdgeSELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_handFROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales group by ss_item_sk) ssJOIN inventory invON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M
M
M
M M
HDFS
Store Sales scan. Group by and aggregation reduce size of this input.
Inventory scan and Join
Broadcast edge
M M M
HDFS
Store Sales scan. Group by and aggregation.
Inventory and Store Sales (aggr.) output scan and shuffle join.
R R
R R
RR
M
MMM
HDFS
Hive : Broadcast Join
© Hortonworks Inc. 2013
Tez – Multiple Outputs
Page 35
Pig : Split & Group-byf = LOAD ‘foo’ AS (x, y, z);g1 = GROUP f BY y;g2 = GROUP f BY z;j = JOIN g1 BY group, g2 BY group;
Group by y Group by z
Load foo
Join
Load g1 and Load g2
Group by y Group by z
Load foo
Join
Multiple outputs
Reduce followsreduce
HDFS HDFS
Split multiplex de-multiplex
Pig – MR Pig – Tez
© Hortonworks Inc. 2013
Tez – Multiple OutputsFROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and
d_year = 2000)INSERT INTO TABLE t1 SELECT distinct ss_item_skINSERT INTO TABLE t2 SELECT distinct ss_customer_sk;
Hive – MR Hive – Tez
M MM
M
HDFS
Map join date_dim/store sales
Two MR jobs to do the distinct
M MM
M M
HDFS
RR
HDFS
M M M
R
M M M
R
HDFS
Broadcast Join (scan date_dim, join store sales)
Distinct for customer + items
Materialize join on HDFS
Hive : Multi-insert queries
© Hortonworks Inc. 2013
Tez – One to One Edge
Page 37
Aggregate
Sample L
Join
Stage sample mapon distributed cache
l = LOAD ‘left’ AS (x, y);r = LOAD ‘right’ AS (x, z);j = JOIN l BY x, r BY x USING ‘skewed’;
Load &Sample
Aggregate
Partition L
Join
Pass through inputvia 1-1 edge
Partition R
HDFS
Broadcastsample map
Partition L and Partition R
Pig – MR Pig – Tez
Pig : Skewed Join
© Hortonworks Inc. 2013
Tez – Custom Edge
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_handFROM store_sales ssJOIN inventory invON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M MM
M M
HDFS
Inventory scan (Runs on cluster potentially more than 1 mapper)
Store Sales scan and Join (Custom vertex reads both inputs – no side file reads)
Custom edge (routes outputs of previous stage to the correct Mappers of the next stage)M MM
M
HDFS
Inventory scan (Runs as single local map task)
Store Sales scan and Join (Inventory hash table read as side file)
HDFS
Hive : Dynamically Partitioned Hash Join
© Hortonworks Inc. 2013
Tez – Bringing it all together
Page 39Architecting the Future of Big Data
Tez Session populates container pool
Dimension table calculation and HDFS split generation in parallel
Dimension tables broadcasted to Hive MapJoin tasks
Final Reducer pre-launched and fetches completed inputs
TPCDS – Query-27 with Hive on Tez
© Hortonworks Inc. 2013
Tez – Bridging the Data Spectrum
Page 40
Fact Table Dimension Table 1
Result Table 1
Dimension Table 2
Result Table 2
Dimension Table 3
Result Table 3
BroadcastJoin
ShuffleJoin
Typical pattern in a TPC-DS query
Fact Table
Dimension Table 1
Dimension Table 1
Dimension Table 1
Broadcast join for small data sets
Based on data size, the query optimizer can run either plan as a single Tez job
BroadcastJoin
© Hortonworks Inc. 2013
Tez – Current status
• Apache Incubator Project–Rapid development. Over 1000 jiras opened. Over 700 resolved–Growing community of contributors and users– Latest release is 0.4
• Focus on stability– Testing and quality are highest priority– Code ready and deployed on multi-node environments at scale
• Support for a vast topology of DAGs– Already functionally equivalent to Map Reduce. Existing Map Reduce
jobs can be executed on Tez with few or no changes–Apache Hive 0.13 release supports Tez as an execution engine (HIVE-
4660)–Apache Pig port to Tez close to completion (PIG-3446)
Page 41
© Hortonworks Inc. 2013
Tez – Adoption
• Apache Hive• Hadoop standard for declarative access via SQL-like interface
• Apache Pig• Hadoop standard for procedural scripting and pipeline processing
• Cascading• Developer friendly Java API and SDK• Scalding (Scala API on Cascading)
• Commercial Vendors• ETL : Use Tez instead of MR or custom pipelines• Analytics Vendors : Use Tez as a target platform for scaling parallel
analytical tools to large data-sets
Page 42
© Hortonworks Inc. 2013
Tez – Adoption Path
Pre-requisite : Hadoop 2 with YARNTez has zero deployment pain. No side effects or traces left behind on your cluster. Low risk and low effort to try out.• Using Hive, Pig, Cascading, Scalding
• Try them with Tez as execution engine
• Already have MapReduce based pipeline• Use configuration to change MapReduce to run on Tez by setting
‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml• Consolidate MR workflow into MR-DAG on Tez• Change MR-DAG to use more efficient Tez constructs
• Have custom pipeline• Wrap custom code/scripts into Tez inputs-processor-outputs• Translate your custom pipeline topology into Tez DAG• Change custom topology to use more efficient Tez constructs
Page 43
© Hortonworks Inc. 2013
Tez – Roadmap
• Richer DAG support– Addition of vertices at runtime– Shared edges for shared outputs– Enhance Input/Output library
• Performance optimizations– Improve support for high concurrency– Improve locality aware scheduling.– Add framework level data statistics – HDFS memory storage integration
• Usability– Stability and testability– API ease of use– Tools for performance analysis and debugging
Page 44
© Hortonworks Inc. 2013
Tez – Community
• Early adopters and code contributors welcome– Adopters to drive more scenarios. Contributors to make them happen.
• Tez meetup for developers and users– http://www.meetup.com/Apache-Tez-User-Group
• Technical blog series– http://
hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing
• Useful links– Work tracking: https://issues.apache.org/jira/browse/TEZ– Code: https://github.com/apache/incubator-tez– Developer list: [email protected]
User list: [email protected] Issues list: [email protected]
Page 45
© Hortonworks Inc. 2013
Tez – Takeaways
• Distributed execution framework that models processing as dataflow graphs
• Customizable execution architecture designed for extensibility and user defined performance optimizations
• Works out of the box with the platform figuring out the hard stuff
• No install – Safe and Easy to Evaluate and Experiment• Addresses common needs of a broad spectrum of applications
and usage patterns• Open source Apache project – your use-cases and code are
welcome• It works and is already being used by Apache Hive and Pig
Page 46
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Video with Deep Dive on Tez http://youtu.be/-7YhVwqky6Mhttp://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
Page 47