Upload
lyquynh
View
248
Download
0
Embed Size (px)
Citation preview
Lajos Papp / DevOps / SequenceIQ Inc.
GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW
GOAL / MOTIVATION
! Ease Hadoop provisioning – everywhere
! Automate and unify the process
! Arbitrary cluster size
! Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
! (Auto) scaling Hadoop
! QoS
OUR APPROACH
! Use Docker
! Build cloud-specific ‘Dockerized’ images
! Provision the cluster
! Use Ambari
DOCKER
! Lightweight, portable
! Build once, run anywhere
! VM – without the overhead of a VM
! Isolated containers
! Automated and scripted
DOCKER – CONTAINERS vs. VMs
! Containers are isolated, but share OS and, where appropriate, bins/libraries
APACHE AMBARI – ARCHITECTURE
! Easy Hadoop cluster provisioning
! Management and monitoring
! Key features – blueprints
! REST API
APACHE AMBARI – CREATE CLUSTER
! Define a blueprint (POST /api/v1/blueprints)
! Create cluster (POST /api/v1/clusters/mycluster)
HADOOP PROVISIONG ISSUES
! Each cloud provider has a proprietary API
! Create images for each provider
! Network configuration
! Service discovery
! Resize, failover, member join support
OUR APPROACH – DETAILS
! Build your Docker image
! Install or pre-install Hadoop services with Ambari
! Install Serf and dnsmasq
! Build your cloud image
! Use Ansible to create an image
! Provision the cluster
BUILD DOCKER IMAGES
! Create the Dockerfile
! Have Docker.io to build the image
! Optionally pre-install services
! Use Ambari
! Push image to Docker.io
! Licensing questions
BUILD CLOUD IMAGES
! Use a Docker ready base image
! Use Ansible to provision the image template
! Pull the Docker images
! Apply custom infrastructure
! Use cloud provider specific playbooks
! AWS EC2
! Azure
ANSIBLE
! Configuration as data
! Simplest way to automate IT
! Secure and agentless
! Goal oriented
! One playbook – multiple modules
! We use it to “burn” cloud images/templates
PROVISIONING – ISSUES
! FQDN
! /etc/hosts is read-only in Docker
! Everybody needs to know everybody
! DNS
! Single point of failure
! Dynamic cluster – nodes joining, leaving, failing
! Routing
! Cloud – ability to inter-host container routing
! Collision free private IP range for Docker bridge
! We need predefined host names/IP addresses
! /etc/hosts is read-only in Docker
! Use Ansible to provision the image template
! Pull the Docker images
! Start a DNS server
! Use it as a reference docker run -dns <IP_OF_DNS>
! Nodes need to know each other
PROVISIONING – SOLUTION
! FQDN
! Use –h and –dns Docker params
! DNS
! dnsmasq is running on each Docker container
! Serf member-xxx events trigger dnsmasq reconfiguration
! Routing
! Docker bridge configuration – follows a convention
SERF
! Gossip based membership
! Service discovery
! Decentralized
! Lightweight, fault tolerant
! Highly available
! DevOps friendly
! Keep an eye on Consul, Open vSwitch, pipework
SERF - DECENTRALIZED SERVICE DISCOVERY
! Gossip instead of heartbeat
! LAN, WAN profiles
! Provides membership information
! Event handlers: member_join, member_leave, member_failed, member-update, member-reap, user
! Query
SERF – GOSSIPING
SERF – MEMBERSHIP, EVENT HANDLERS
DNSMASQ
! Network infrastructure for small networks
! Lightweight DNS, DHCP server
! Comes with most Linux distributions
AWS EC2 – HADOOP CLUSTER
! Use EC2 REST API to provision instances (from Dockerized image)
! Start Docker containers
! One Ambari server
! N-1 Ambari agents connecting to server
! Connect ambari-shell to
! Define blueprint
! Provision the cluster
AWS EC2 – NETWORK SECURITY
! Create a VPC
! Configure subnets
! Routing tables
! Security gateway
! Set ACL
! Configure VPN
AWS EC2 - CLOUDFORMATION
! Manually set up VPC is too complicated
! Use CloudFormation
! Manage the stack together
! Template-based
! Environments under version control
! Customizable at runtime
! No extra charge
"VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } }, "Mappings" : { "RegionMap" : { "us-east-1" : { "AMI" : "ami-7f418316" },
CLOUDBREAK
Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
CLOUDBREAK
! Benefits ! Elastic
! Scalable
! Blueprints
! Flexible
! Main REST resources ! /template – specify a cluster infrastructure
! /stack – creates a cloud infrastructure built from a template
! /blueprint – describes a Hadoop cluster
! /cluster – creates a Hadoop cluster
RESULTS AND ACHIEVEMENTS
! Hadoop as a Service API
! Available for EC2 and Azure cloud
! OpenStack, bare metal is coming soon
! Open source under Apache 2 licence
! Same goals as Apache Ambari Launchpad project
! What's next?
HADOOP SERVICES - AS A SERVICE
! Leverage YARN
! Slider (Hoya) providers
! HBase, Accumulo
! SequenceIQ providers - Flume, Tomcat
! YARN -1964
! QoS for YARN – heuristic scheduler
! Platform as a Service API
BANZAI PIPELINE
Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful
application development
platform for building on-demand
data and job pipelines running
on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
THANK YOU
! Get the code: https://github.com/sequenceiq
! Read about: http://blog.sequenceiq.com
! Facebook: http://facebook.com/sequenceiq
! Twitter: http://twitter.com/sequenceiq
! LinkedIn: http://linkedin.com/sequenceiq
! Contact: [email protected]
FEEL FREE TO CONTRIBUTE