Running Non- MapReduce Applications on Apache Hadoop

Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth

Hortonworks Inc.

Who am I?• Hitesh Shah

–Member of Technical Staff at Hortonworks Inc.–Apache Hadoop PMC member and committer–Apache Tez and Apache Ambari PPMC member and

committer• Siddharth Seth

–Member of Technical Staff at Hortonworks Inc.–Apache Hadoop PMC member and committer–Apache Tez PPMC member and committer

Architecting the Future of Big Data

Agenda

•Apache Hadoop v1 to v2

•YARN

•Applications on YARN

•YARN Best PracticesPage 3Architecting the Future of Big Data

Apache Hadoop v1

Job ClientSubmit Job JobTracker

TaskTracker

TaskTrackerTaskTrackerTaskTracker

TaskTracker

Map Slot

Reduce Slot

Apache Hadoop v1

•Pros:–A framework to run MapReduce jobs that allows you to run the same piece of code on a single node cluster to one spanning 1000s of machines.

•Cons:–It is a framework to run MapReduce jobs.

Apache Giraph

• Iterative graph processing on a Hadoop cluster• An iterative approach on MapReduce would require running multiple jobs.• To avoid MR overheads, runs everything as a Map-only job.

Map Task:Master

Map Task:Worker

Apache Oozie

• Workflow scheduler system to manage Hadoop jobs.• Running a PIG script through Oozie

JobTrackerOozie

MapTask:Pig Script Launcher

Submit Job

Submit SubsequentMR jobs

Apache Hadoop v2

YARNThe Operating System of

a Hadoop clusterArchitecting the Future of Big Data Page 9

The YARN Stack

YARN Glossary

• Installer–Application Installer or Application Client

•Client–Application Client

•Supervisor–Application Master

•Workers–Application Containers

YARN Architecture

ResourceManager

NodeManager

Client

Submit Application

NodeManagerNodeManagerNodeManager

AppMaster Container

Container

Container Container

AppMaster

Container

ContainerContainer

YARN Application Flow

Application Client

ResourceManager

Application Master

NodeManager

YarnClient

AppSpecific API

Application ClientProtocol

AMRMClient

NMClient

Application MasterProtocol

ContainerManagement

Protocol

AppContainer

YARN Protocols & Client Libraries• Application Client Protocol: Client to RM interaction

– Library: YarnClient– Application Lifecycle control– Access Cluster Information

• Application Master Protocol: AM – RM interaction– Library: AMRMClient / AMRMClientAsync– Resource negotiation– Heartbeat to the RM

• Container Management Protocol: AM to NM interaction– Library: NMClient/NMClientAsync– Launching allocated containers– Stop Running containers

Applicationson YARN

Architecting the Future of Big Data Page 15

YARN Applications

• Categorizing Applications– What does the Application do?– Application Lifetime– How Applications accept work– Language

• Application Lifetime– Job submit to complete.– Long-running Services

• Job Submissions– One job : One Application– Multiple jobs per application

Language considerations• Hadoop RPC uses Google Protobuf

–Protobuf bindings: C/C++, GO, Java, Python…

• Accessing HDFS–WebHDFS– libhdfs for C–Python client by Spotify Labs: Snakebite

• YARN Application Logic–ApplicationMaster in Java and containers in any language

Tez ( App Submission)

• Distributed Execution framework – computation is expressed as a DAG• Takes MapReduce to the next level – where each job was limited to a

Map and/or Reduce stage.

YARN Tasks

Resource Manager

• DAG execution logic• Task co-ordination• Local Task Scheduling

Tez AM

Node Manager(s)

Launch AM

AM Launched

• Job Submission• Monitoring

Tez Client

Request ResourcesAllocated Resources

Launch TasksLaunch AM

Tasks Launched

Heartbeat

Submit DAG

HOYA ( Long Running App )

• On Demand HBase cluster setup• Share cluster resources – persist and shutdown the cluster when not

needed• Dynamically handles Node failures• Allows re-sizing of a running HBase cluster

Resource Manager

Node Manager(s)

Get New Containers

Kafka (Streams)

Samza AM

Task ContainerTask Task

Container Finished

Launch Container

Samza on YARN ( Failure Handling App )

• Stream processing system – uses YARN as the execution framework• Makes use of CGroups support in YARN for CPU isolation• Uses Kafka as underlying store

YARN Eco-system

Powered by YARN• Apache Giraph – Graph Processing• Apache Hama - BSP• Apache Hadoop MapReduce – Batch• Apache Tez – Batch/Interactive • Apache S4 – Stream Processing• Apache Samza – Stream Processing• Apache Storm – Stream Processing• Apache Spark – Iterative/Interactive

applications• Cloudera Llama• DataTorrent• HOYA – HBase on YARN• RedPoint Data Management

YARN Utilities/Frameworks• Weave by Continuity• REEF by Microsoft• Spring support for Hadoop 2

YARNBest Practices

Best Practices

• Use provided Client libraries• Resource Negotiation

– You may ask but you may not get what you want - immediately.– Locality requests may not always be met. – Resources like memory/CPU are guaranteed.

• Failure handling– Remember, anything can fail ( or YARN can pre-empt your containers)– AM failures handled by YARN but container failures handled by the

application.• Checkpointing

– Check-point AM state for AM recovery.– If tasks are long running, check-point task state.

Best Practices

• Cluster Dependencies–Try to make zero assumptions on the cluster.–Your application bundle should deploy everything required using

YARN’s local resources.• Client-only installs if possible

–Simplifies cluster deployment, and multi-version support• Securing your Application

–YARN does not secure communications between the AM and its containers.

Testing/Debugging your Application

• MiniYARNCluster–Regression tests

• Unmanaged AM–Support to run the AM outside of a YARN cluster for manual

testing.• Logs

–Log aggregation support to push all logs into HDFS–Accessible via CLI, UI.

Future work in YARN

• ResourceManager High Availability and Work-preserving restart– Work-in-Progress

• Scheduler Enhancements– SLA Driven Scheduling, Gang scheduling– Multiple resource types – disk/network/GPUs/affinity

• Rolling upgrades• Long running services

– Better support to running services like HBase– Discovery of services, upgrades without downtime

• More utilities/libraries for Application Developers– Failover/Checkpointing

• http://hadoop.apache.org

Questions?

Running Non- MapReduce Applications on Apache Hadoop

Documents

Hadoop MapReduce

Distributed Computing with Apache Hadoop. Introduction to MapReduce

Wprowadzenie do Apache Spark · 2017-01-20 · Wprowadzenie do Apache Spark Jakub Toczek. Epoka informacyjna. MapReduce. MapReduce. Apache Hadoop narodziny w 2006 roku z Apache Nutch

Hadoop / Mapreduce

Perspektiverende Datalogigerth/dPersp14/internetalgorithms/mapreduce slides.pdfMapReduce Implementationer MapReduce Hadoop Apache open source projekt Dean, F. and Ghemawat, S. (2004)

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance

MapReduce Programming With Apache Hadoop

Running Non- MapReduce Applications on Apache Hadoop

MapReduce TSQR: using Apache Hadoop for Large …arbenson.github.io/portfolio/Math221/AustinBenson-math...MapReduce TSQR: using Apache Hadoop for Large-scale QR Decompositions of Tall

Hadoop Integration Function User's Guide...-In the case of integrating with Apache Hadoop: Hadoop distributed file system (HDFS: Hadoop Distributed File System). Figure 1.1 MapReduce

Introduction to Apache Spark - tropars.github.io · I Hadoop MapReduce, Apache Spark, Apache Flink, etc 25. Agenda Computing at large scale Programming distributed systems MapReduce

MapReduce Programming with Apache Hadoop - DSTdst.lbl.gov/ACSDownloads/kjackson/downloads/Hadoop-HDFS8-12pm.… · MapReduce Programming with Apache Hadoop Viraj Bhat ... (hadoop,

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Extending Hadoop beyond MapReduce Webinarhortonworks.com/wp...Hadoop...MapReduce_Webinar.pdf · • Apache Hadoop since 2006 - committer and PMC member ... – Over1000 functional

MapReduce, HDFScs61c/sp17/lec/32/lec32.pdf · Big Data Framework: Hadoop & Spark • Apache Hadoop • Open-source MapReduce Framework • Hadoop Distributed File System (HDFS) •

Apache Hadoop FileSystem Internals - SNIA · Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System ... July 2005 – Nutch uses MapReduce

Presented by CH.Anusha. Apache Hadoop framework HDFS and MapReduce Hadoop distributed file system JobTracker and TaskTracker Apache Hadoop NextGen

Herramientas y ejemplos de trabajos MapReduce con Apache Hadoop

Introduction to Apache Spark - University of Waterloocs451/slides/... · Hadoop MapReduce Architecture Hadoop v1.0. Hadoop v1.0. Hadoop v2.0. Spark Architecture. Title: Developing

Spring for Apache Hadoop - Reference Documentation...Part I. Introduction Spring for Apache Hadoop provides integration with the Spring Framework to create and run Hadoop MapReduce,