Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
PaaS on Hadoop YARN
Idea and Prototype
ABSTRACT
This document describes a prototype implementation of a simple PAAS built on the Hadoop YARN framework.
2 PAAS ON HADOOP YARN
SAP Labs
Table of Contents
Why was HADOOP YARN considered as PAAS? ............................................................ 3
Comparison with VMware’s Cloud Foundry .................................................................. 3
Hadoop PAAS Prototype .............................................................................................. 6 Prototype Scope .......................................................................................................................................................... 6 Architecture Diagram ............................................................................................................................................... 7 Application War File Provisioning Flow .......................................................................................................... 9 Starting-‐up Application Instances Flow ........................................................................................................... 9 Stopping Application Instances Flow ............................................................................................................. 10
PAAS ON HADOOP YARN 3
SAP Labs
Why was HADOOP YARN considered as PAAS? A large scale PAAS offering needs to solve many of the same problems that Hadoop already addresses. Hadoop is a framework for running applications on large clusters of
commodity hardware. This commodity characteristic drives down costs and avoids vendor lock in.
Many companies have already deployed a Hadoop cluster for big data processing. So it can be easier in terms of development and operational adaptation to use it as PAAS than to adopt another PAAS solution. This can also bring in efficiency of resource utilization as one Hadoop cluster can be used for both big data processing and PAAS.
Hadoop YARN’s three-‐layer architecture (client, Hadoop YARN core, application instances) looks like a good fit for PAAS.
Hadoop YARN’s monitoring and resource management capabilities (i.e. Resource Manager and Node Managers) seem promising for PAAS
Hadoop is a JAVA framework, which is good for many enterprise companies in the sense that they already have enough Java developers.
With HBase, Hadoop can be used as a NoSQL database, which can be exposed as a PAAS service.
Comparison with VMware’s Cloud Foundry Before we jump to the prototype design and implementation, let’s briefly take a look at a well-‐known PAAS, VMware’s Cloud Foundry. The following gap analysis highlights missing components in Hadoop YARN that would need to be added in order to create a YARN-‐based PaaS offering. Note that we consider the CloudFoundry.org open source offering here, not the CloudFoundry.com commercial offering.
4 PAAS ON HADOOP YARN
SAP Labs
Figure 1 VMware's Cloud Foundry
Cloud Foundry has three distinctive architectural layers: the client layer, the core layer, and the services and application layer. Interestingly, this three-‐layer architecture is very similar to the Hadoop YARN architecture. Client Layer
Feature Cloud Foundry Hadoop PAAS Proposal Management Client Has a command-‐line
management client called ‘vmc’ to provision/update applications, manage services, and get information about applications
Develop a generic YARN client to handle commands (e.g. push, start, stop, …)
Application Provisioning
Supports WAR file based application provisioning (e.g. vmc push myapp.war)
We can mimic the same push behavior in the YARN client mentioned in the previous item,
PAAS ON HADOOP YARN 5
SAP Labs
assuming that web containers (e.g. Jetty or Tomcat) are launched and run as application containers in nodes.
Cloud Foundry System Layer – Core Architecture
Feature Cloud Foundry Hadoop PAAS Proposal Request Routing Has a Router component, which
is actually Ngnix, to route request traffic
There is nothing like Router in Hadoop YARN. We need a high performance http proxy similar to Nginx.
Health Management
Health Manager of Cloud Foundry wakes up periodically and does health checks and does a remedy if necessary
In Hadoop YARN, the Application Master in coordination with the Resource Manager can perform a similar role.
Messaging Uses NAT as a messaging backbone among various components to coordinate synchronization and update states
We need some synchronization service among Hadoop components and non-‐Hadoop components (e.g. Router). HBase uses ZooKeeper to provide similar functionality.
Main Controller The Cloud controller of Cloud Foundry
The Resource Manager and Node Managers do a similar role.
Droplet Execution Engines Layer – Running Services and Applications
Feature Cloud Foundry Hadoop PAAS Proposal App Running Environment
Droplet Execution Engine is the environment to run service and application instances
A container is YARN’s shell environment to run an application instance.
Java Web Container
Uses Tomcat as an application container in the Droplet Execution Engine by default
We can use any embeddable web container (e.g. Jetty)
Service support Allows to register/unregister services (MySQL, NoSQL services, and so on) and provision applications integrated with those services
One of the biggest gaps between Cloud Foundry and Hadoop YARN. No service support in Hadoop YARN.
Multi-‐tenancy Supports service level Multi-‐tenancy. For example, the user can provision a database service (e.g. MySQL) and bind it to his/her application, which means that the database service instance is dedicated to the application instances
Has limited support. Only Container-‐level multi-‐tenancy is supported.
6 PAAS ON HADOOP YARN
SAP Labs
Hadoop PAAS Prototype As we saw in the previous section, Hadoop YARN is missing some PAAS components and capabilities when compared to Cloud Foundry. But it still has many desirable features for a PAAS. In order to better understand the characteristics of such a system, we attempted to implement a basic PAAS prototype by leveraging built-‐in functionality as much as possible and filling gaps with additional services as needed.
Prototype Scope The prototype was scoped to implement very basic PAAS functionalities:
• A generic command line YARN client to provision applications, start application instances, stop a number of running application instances.
• Only Java Web applications are supported, and only WAR files as application provision packages
The following features were not requirements of the prototype: • Auto scaling (elasticity) • Multi-‐tenancy support • Service (relational or NoSQL database services) provision and integration
PAAS ON HADOOP YARN 7
SAP Labs
Architecture Diagram
Figure 2 Prototype Architecture
To work with the Hadoop YARN framework, four java projects were created:
• PAAS Client (PaasClient): hadoop-‐yarn-‐applicatons-‐paas-‐client • PAAS Application Master (PaasAppMaster): hadoop-‐yarn-‐applications-‐paas-‐
master • PAAS Application Container (PaasAppContainer): hadoop-‐yarn-‐applications-‐
paas-‐container • PAAS Zookeper Client Library: Hadoop-‐yarn-‐applications-‐zkclient
Beyond that, since we needed the capability of router to serve as an entry point for incoming application requests and route requests to the corresponding application instances, a java implementation of http proxy called LittleProxy was slightly modified and used (Note that this LittleProxy implementation is not part of the PaasS application) Furthermore, to synchronize the states of application instances among PAAS components, ZooKeeper was used.
PAAS Client (PaasClient): hadoop-‐yarn-‐applications-‐paas-‐client
8 PAAS ON HADOOP YARN
SAP Labs
PaasClient is a dedicated YARN client that works like a command shell to process PAAS commands. It’s written in Java. Users (i.e. operators) can issue commands in the client to provision application files (i.e. war files), start or stop application instances, get a list of instance information, and get YARN application status. Here are some example commands:
Command Example Description push /Users/PAAS/HelloWorld.war Provisions the HelloWorld application to the
Hadoop file system so that later, the PAAS Application Container can pick it up
start HelloWorld 2048 2 Starts 2 instances of the HelloWorld service with a maximum memory limit of 2G
stop HelloWorld 3 Stops 3 instances of HelloWorld instances HelloWorld Returns information about running
HelloWorld instances
PAAS Application Master (PaasAppMaster): hadoop-‐yarn-‐applications-‐paas-‐master PaasAppMaster is a YARN application master that manages a lifecycle of PAAS application instances. For each ‘start’ command, one instance of PaasAppMaster gets created, which then tries to create the requested number of PaasAppContainer instances. Its life lasts until all the PaasAppContainer instances stop or fail. PAAS Application Container (PaasAppContainer) – Jetty Web Container: hadoop-‐yarn-‐applications-‐paas-‐container PaasAppContainer is instantiated by PaasAppMaster as a YARN container according to the requested resource limit (e.g. memory limit of 2G). It is a wrapper around the embedded Jetty web container that loads the war file for the requested service (e.g. ‘start HelloWorld 2048 2’), and interacts with Zookeeper to register and unregister itself so that the Router can update its routing table. Router (modified LittleProxy) The Router’s main function is to route and distribute application requests to the running application instances. We used LittleProxy, a java implementation of HTTP proxy using Netty asynchronous networking framework, and modified it so that it maintains a routing table in ZooKeeper and routes requests based on the routing table. Whenever a PaasAppContainer instance registers or unregisters to ZooKeeper, this event is notified to Router, which in turn updates the routing table accordingly. ZooKeeper: hadoop-‐yarn-‐applications-‐paas-‐zkclient ZooKeeper is a centralized service to provide distributed synchronization or maintain configurations or provide group services. Zookeeper was used for the prototype for two main reasons:
PAAS ON HADOOP YARN 9
SAP Labs
To keep track of running application instances. PaasAppContainer instance registers or unregisters itself to ZooKeeper when it starts or stops. This information is used by Router.
To stop application instances in coordination with PaasAppContainer instances as YARN does not provide a functionality to stop individual containers.
Application War File Provisioning Flow For simplicity, application war files are provisioned to Hadoop file system (HDFS) so that PaasAppContainer instances can pick them up when a ‘start’ command gets executed. This happens in the following order:
• An Admin user issues a push command in PaasClient like “push /Users/PAAS/svcA.war”, where svcA.war file is a web application file of the service named ‘svcA’. Then the PaasClient uploads the file to the Hadoop server by using ‘scp’ and ‘ssh’. There is a dedicated directory in the Hadoop file system to save War files (e.g. /PAAS/), while non-‐application libraries and jar files are stored in a different directory.
Starting-‐up Application Instances Flow Assuming that all the PAAS system jar files (e.g. PaasAppContainer and PaasAppMaster) and an application war file “svcA.war” are already provisioned, here are the steps to start up application instances:
1. Admin user issues a start command in PaasClient like “start svcA 2048 3” which is to start 3 “svcA” application instances of 2G memory limit.
2. PaasClient connects and requests to Hadoop Resource Manager one container for PaasAppMaster. Once it gets one allocated, it issues a request to start PaasAppMaster in the container with some parameters such as what application to run and how many instances to start.
3. PaasAppMaster connects to Resource Manager again and requests for containers for PaasAppContainer instances as many as the requested number of instances. After it gets containers allocated, it issues requests to start up PaasAppContainer instances for the application name “svcA”
4. When PaasAppContainer instances get started, they load the corresponding WAR file from Hadoop file system and starts embedded Jetty web container. After the Jetty web container is successfully started, the PaasAppContainer instance registers itself to ZooKeeper with the service url of the Jetty web container.
5. This registration event will be notified to Router, and then Router updates its routing table. At this point, the application instance is ready to serve app users’ requests.
10
PAAS ON HADOOP YARN
SAP Labs
Stopping Application Instances Flow Because Hadoop YARN does not provide a way to stop an individual container, the prototype uses ZooKeeper’s synchronization capability to process stop commands:
1. Admin user issues a stop command in PaasClient like “stop svcA 2” which means to stop 2 instances of svcA application.
2. PaasClient chooses randomly as many instances as the requested number among the running app instances and deletes the corresponding registration information from ZooKeeper.
3. These deletion (i.e. un-‐registration) events will be notified to the corresponding PaasAppContainer instances, and then the instances stop themselves. Although we can do a more intelligent container stopping process such as waiting for all the existing sessions to complete their works for a reasonable duration before stopping the container, we chose the simplest approach of stopping right away for the prototype. This is certainly a limit of the prototype.
PAAS ON HADOOP YARN 11
SAP Labs