Upload
others
View
11
Download
1
Embed Size (px)
Citation preview
1
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 1
Building a Carnegie Mellon cloud
15-719/18-709: Advanced Cloud Computing
Greg GangerMajd Sakr
George Amvrosiadis TA Office Hours
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 2
CMU’s dilemma
• Multiple computing resources available at CMU• PRObE clusters: Susitna, Marmot, Narwhal, …• Kiska OpenStack cluster• Orca BigLearning cluster• Stoat Hadoop cluster
…and that’s just theParallel Data Lab!
• Problem: pretty much used only during deadlines• How can we take advantage of spare cycles?
2
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 3
Monetizing spare cycles
• Option 1: rent out entire machines• For users/projects willing to pay for whatever is available• Similar to renting a car• Problem: machines are heterogeneous, and users have
different demands
• Option 2: rent out portions of machines• Portion equal to capability wanted by a user• Similar to renting a suite in a hotel• A {time, space} sharing approach
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 4
CMU cloud gets off the ground
• Business model• Users rent portions of machines
• User workloads supported• Software development• Running experiments in panic
right before a conference deadline• Data analytics• Training AI/ML models• Bitcoin mining
3
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 5
The initial CMU cloud
Provisioner & Scheduler
Compute
Memory
Storage
Shared Storage
User rent requests
User 1 allocation User 2 and 3 allocations
Resource amount
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 6
The Provisioner: Input
• Basic: C compute cores,
M MBs memory,
U GBs storage
• Other constraints: network, special hardware (GPUs?), …
• AWS creates bins of these (e.g., m1.large)
• Similar to hotel room types:
Single, Double, Queen, King, Junior Suite, Master Suite, …
• Optional features
• Persistent storage
• Failure domains and Auto-scaling
• Elastic IP addresses
4
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 7
The Provisioner: Output
• Assignment of users to machines• Matching: requested resources ↔ available machines
• Bin-packing problem: saIsfy max possible user requests• Solving is NP-hard – Deciding if there is a soluIon: NP-complete
• HeurisIcs and assumpIons used to reduce complexity:e.g., distribuIon of request sizes, duraIon, inter-arrival Imes, …
• Some degrees of freedom: migraIng exisIng users• For efficiency: allocate more reliable resources
• For capacity: fit more people in the cloud
• Cost-benefit problem: migraIon can increase job runImes!
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 8
The scheduler
• Purpose: choose which user jobs/processes to run
• Goal 1: Prioritization• Some users are more important ($$$) than others
• Goal 2: Oversubscription• Allocate more resources than available• Why: most users under-utilize resources
• Goal 3: Workload constraints• Gang scheduling: must co-schedule distributed software
that runs in lock-step
5
Jan 23, 2019 15-719/18-709: Advanced Cloud Compu<ng 9
CMU cloud + scheduler
Provisioner & Scheduler
Compute
Memory
Storage
Shared Storage
User rent requests
User 1 allocation User 2 and 3 allocations
User 1: C1, M1, U1User 2: C2, M2, U2
User 3: C3, M3, U3
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 10
Encapsulation
• Goal: encapsulaAon for compute, storage, networking, and data
• We need to isolate co-located user processes• Use selected soHware/OS environment• Avoid issues due to environment/configuraAon changes• Avoid performance interference• Keep files and data private
• Ensure that users obtain resources they paid for
6
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 11
Virtualization
• Goal: multiplex single physical resource among multiple software-based resources• Hardware, OS → Virtual machines• Networks → VLANs• Disks → Virtual disks
• Used by: Amazon, Rackspace, Microsoft, …
Does not solve all encapsulation issues(e.g., performance interference)
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 12
User 1: C1, M1, U1User 2: C2, M2, U2
User 3: C3, M3, U3CMU cloud + encapsulation
Provisioner & Scheduler
Compute
Memory
Storage
Shared Storage
User rent requests
User 1 allocation User 2 and 3 allocations
7
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 13
Fault tolerance• Hardware and software may fail at any time
[“Software Engineering Advice from Building Large-Scale Distributed Systems.” Jeff Dean, Google]
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 14
Fault tolerance mechanisms• For storage• ReplicaDon (within/across data centers)• Redundant Array of
Independent Disks (RAID)
• For criDcal infrastructure and services• State-machine replicaDon, checkpoinDng, logging• Use state-free soPware design
A4 D4B4 C4P4
Parityblocks
A3 D3B3 C3P3
A2 D2B2 C2P2
A1 D1B1 C1 P1
A0 D0B0 C0 P0
Disk 0 Disk 1 Disk 2 Disk 3 Disk 4
8
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 15
CMU cloud + replication
Provisioner & Scheduler
Shared Storage
Provisioner & Scheduler
Shared Storage
Geo-replication
Replication
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 16
Can we do better?
• Add more value: provideservices as building blocks
• Storage services: scalable, fault tolerant data stores• E.g., key-value stores, file systems, databases
• Tools: Programming models and frameworks• E.g., analytics, HPC clusters, training AI models
• Automation: Reactive systems and elastic scaling• E.g., monitoring and tracking, load balancers
9
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 17
CMU cloud + building blocks
Provisioner & Scheduler
Compute
Memory
Storage
Building Blocks(e.g., DynamoDB, SQS)
Usage monitorShared Storage
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 18
Programming frameworks
• High-level frameworks and languages for efficiently processing different data types• E.g., Map Reduce, DryadLINQ, Spark
• Built as distributed soUware• Have their own scheduler• Have their own fault tolerance
mechanisms
10
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 19
CMU cloud + frameworks
Provisioner & Scheduler
Compute
Memory
Storage
Framework Scheduler Programm
ingfram
ework
Building Blocks
Usage monitorShared Storage
Two-levelscheduling
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 20
Elastic scaling
• Observed load for aservice is variable• High-load during Steam game sales,
or when Nintendo Smash Bros U is released
• Traditional solution: provision for peak• The Elastic scaling approach• Cloud monitors load• Adds application instances as necessary
11
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 21
CMU cloud + elastic scalingElastic Scaler
Provisioner & Scheduler
Compute
Memory
Storage
Framework Scheduler Programm
ingfram
ework
Building Blocks
Usage monitorShared Storage
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 22
Monitoring and diagnosis
• Cloud monitors user applicaAons• Provides alerts when thresholds crossed• E.g., Ganglia, AWS cloud watch
• Cloud provides tracing libraries• Analyzes traces for problem root causes• E.g., Dapper
12
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 23
CMU cloud + monitoringElastic Scaler
Provisioner & SchedulerMonitoring & Diagnosis
Compute
Memory
Storage
Framework Scheduler Programm
ingfram
ework
Building Blocks
Usage monitorShared Storage
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 24
Pop Quiz: OpenNebula
• What type or types of cloud is OpenNebula targeted to?• Private / hybrid IaaS cloud.
(Section: Virtual Infrastructure Management)
• What is the difference between OpenNebula and Haizea?• OpenNebula is a VI manager used to deploy and manage VMs. Haizea is a
resource (lease) manager and scheduling backend.(Section: Virtual Infrastructure Management)
• What was the primary goal of developing OpenNebula,which was not covered by existing solutions?• A VI management solution with a flexible and open
architecture for building private/hybrid clouds.(Section: The Cloud Ecosystem)
Uh oh…
13
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 25
Open-source VI mgmt. software
• OpenNebula: deploying/managing groups of 1+ VMs
• Automates aspects of VM setup on physical hosts
• E.g., preparing disk images, setting up networking, etc.
• Supports many VM types (KVM, VMware); external clouds
• Note: external cloud support enables hybrid cloud deployments
• Haizea: managing “leasing” of resources
• Can be used as a scheduling component of OpenNebula
• Decides which resources to assign a VM group (and when)
• Reservation and preemption/migration support
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 26
OpenNebula Architecture [Sotomayor2009]
14
Jan 23, 2019 15-719/18-709: Advanced Cloud Computing 27
OpenNebula Architecture [Sotomayor2009]
• OpenNebula core: orchestrates use of driver plugins• Virtualization drivers:
Functions for setting up, starting, stopping VMs, etc.• Network drivers:
Functions for assigning network addresses• Storage drivers:
Functions for attaching network storage resources• External cloud drivers:
Functions for putting VMs on an external cloud (e.g., EC2)
• Scheduler:decides which VMs get which physical resources
Jan 23, 2019 15-719/18-709: Advanced Cloud Compu<ng 28
Readings
• Required: “Virtual infrastructure management in private and hybrid clouds.” B. Sotomayor, R. S. Montero, I. M. Llorente, and I. Foster. IEEE Internet Computing, 2009.
• Optional: “The Eucalyptus Open-Source Cloud-Computing System.” D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. CCGrid, 2009.
• Optional: “Beyond virtual data centers: Toward an open resource control architecture.” J. Chase, L. Grit, D. Irwin, V. Marupadi, P. Shivam, A. Yumerefendi. ACM International Conference on the Virtual Computing Initiative, 2007.
• Optional: “Get started with OpenStack.” OpenStack Documentation, Liberty Release (October 2015).
15
15-719/18-709Advanced Cloud Computing
Lecture 03
OpenStack
January 23, 2019
http://www.cs.cmu.edu/~15719/
Motivation for Resource Provisioning
• Users locate compu6ng resources based on:– HW architecture– Memory capacity– Storage capacity– Network connec6vity– Geo loca6on
• Involves:– Resource availability– Applica6on and performance profiling– SoFware service requirements
16
Resource Provisioning
• Resource provisioning for large # of resources– Contact several providers– Get heterogeneous resources• Performance profiling becomes difficult• Efficient use of resources becomes difficult
• Few users can exploit heterogeneity– Uniformity makes application development and
deployment easier
IAAS Cloud• Convert a manual large-scale resource provisioning
and programming problem into elastic utility (cloud computing)– Self deployment model
• Current public cloud IAAS offerings are proprietary– Do not allow
• Experimentation• Instrumentation• Deployment of a private cloud
• Private cloud software systems allow users to customize, extend, and experiment with management infrastructure
17
IAAS• To offer cloud computing services running on
standard hardware.– Software defined datacenter as a service
• Deploy a private cloud• Offer an IAAS public cloud service• Instrument to answer open questions:
1. What is the right distributed architecture for a cloud computing system?
2. What resource characteristics must VM schedulers consider to make the most efficient use of the resources?
Cloud Interfaces & Abstrac3ons• Cloud offers a spectrum– From IaaS
• Dynamically provision VMs, Storage, networking– To SaaS
• Flexible access to hosted services• All resources– Should be well defined– Provide reasonably deterministic performance– Can be allocated and de-allocated on demand– Should provide some level of security guarantees
through isolation
18
Cloud Objectives and Design Goals
• Rapidly scale up and down as load fluctuates• Support a large number of users requiring
resources on demand• Provide stable access to provided resources over
the Internet• Optimize use of capital computing infrastructure– Avoid having partially idle computers when they
could be doing something useful instead
Cloud Service ConfigurationManagement
Health/StatusMonitoring
Cloud Service
User
Cloud Users and Services
19
Interfaces
• Application development/deployment user– Sign up, credentialing, query system– Create, modify, interrogate data– Provision, submit job
• Administrator– Manage user accounts, inspect availability– Monitor and manage resources
Services Needed for IAAS?
IAAS
Compute
Networking
Dashboard
Storage
OS Images
Orchestra-tion
Identity
Telemetry
20
OpenStack – 1• Controls pools of compute, storage, and networking
resources in a datacenter.• Managed through a dashboard that gives– administrators control– users ability to provision resources through a web interface.
https://www.openstack.org/software/
OpenStack – 2• Joint effort by Rackspace and NASA in 2010• Community, OpenStack Foundation, defines
official OpenStack components• Open philosophy– Open, no trade secrets– Companies are contributing to open source– Can be integrated with proprietary components
• Release cycle is 6-months
21
OpenStack – 3• Open source cloud computing framework that uses
computing and storage infrastructure to provide a platform to offer cloud services on standard hardware.– Offer a common open-source framework and a community
will form• As of 2018, 82k members, 667 companies have joined
– Offers compatibility with Amazon Web Services• Such as EC2, S3, EBS, …
• Provides management infrastructure for existing underlying technologies– E.g., manages usual QEMU/KVM or Xen hypervisor, LXC,
Docker, Linux bridge and VXLAN networking, etc.
OpenStack – 4• OpenStack consists of several independent parts,
named the OpenStack services.• All services authenticate through a common
Identity service.• Individual services interact with each other
through public APIs, except where privileged administrator commands are necessary.
• Provides a modular architecture to reuse existing infrastructure to manage various types of resources (e.g., virtual machine, container, “bare metal” physical machine)
22
Independent Services
PublicAPIs
Compute
Networking
Dashboard
Storage
OS Images
Orchestra-?on
Identity
Telemetry
OpenStack ServicesService Codename
• OpenStack Compute • Nova• OpenStack Object Storage • Swift• OpenStack Block Storage • Cinder• OpenStack Networking • Neutron• OpenStack Dashboard • Horizon• OpenStack Identity • Keystone• OpenStack Image service • Glance• OpenStack Telemetry service • Ceilometer• OpenStack Orchestration service • Heat• OpenStack Database service • Trove• OpenStack Data processing service • Sahara
23
Conceptual Architecture
https://ilearnstack.files.wordpress.com/2013/04/openstack-conceptual-arch.jpg
Conceptual Architecture
Service CodenameCompute NovaObject Storage SwiftBlock Storage CinderNetworking NeutronDashboard HorizonIdentity KeystoneImage service GlanceTelemetry service CeilometerOrchestration service HeatDatabase service TroveData processing service Sahara
hFp://docs.openstack.org/admin-guide-cloud/common/get_started_conceptual_architecture.html
24
Access, Communication and APIs• All OpenStack services have at least one API process– Listens for API requests, preprocesses them and passes them
on to other parts of the service.
• For communication between the processes of one service– An advanced message queuing protocol (AMQP) message
broker is used.
• Users can access OpenStack via– the web-based user interface implemented by the dashboard– command-line clients – issuing API requests through tools, browser plug-ins or curl
• For applications, several SDKs are available.• All access methods issue REST API calls to the various
OpenStack services.
https://ilearnstack.files.wordpress.com/2013/04/openstack-arch-grizzly-v1-logical.jpg
Dashboard
ComputeService
Block Storage
ObjectStore Network
ServiceImageService
Identity Service
25
https://ilearnstack.files.wordpress.com/2013/04/openstack-arch-grizzly-v1-logical.jpg
http://docs.openstack.org/ops-guide/_images/osog_0001.png
http://docs.openstack.org/ops-guide/_images/osog_0001.png
26
Logical Architecture
http://docs.openstack.org/admin-guide-cloud/common/get_started_logical_architecture.html
Keystone – OpenStack Identity• Provides authentication and authorization for other
OpenStack services and users• Provides a catalog of endpoints for all OpenStack
services (service discovery)• Central to most OpenStack operations, hence the name(openstack) endpoint list --service identity -c "Service Name" -c "Service Type" -c "Interface" -c "URL"+--------------+--------------+-----------+------------------------------------------------+| Service Name | Service Type | Interface | URL |+--------------+--------------+-----------+------------------------------------------------+| keystone | identity | admin | http://kiska.pdl.local.cmu.edu:35357/v3 || keystone | identity | internal | http://kiska-control.pdl.local.cmu.edu:5000/v3 || keystone | identity | public | http://kiska.pdl.local.cmu.edu:5000/v3 |+--------------+--------------+-----------+------------------------------------------------+(openstack) endpoint list --service nova -c "Service Name" -c "Service Type" -c "Interface" -c "URL"+--------------+--------------+-----------+----------------------------------------------------------------+| Service Name | Service Type | Interface | URL |+--------------+--------------+-----------+----------------------------------------------------------------+| nova | compute | internal | http://kiska-control.pdl.local.cmu.edu:8774/v2.1/%(tenant_id)s || nova | compute | admin | http://kiska.pdl.local.cmu.edu:8774/v2.1/%(tenant_id)s || nova | compute | public | http://kiska.pdl.local.cmu.edu:8774/v2.1/%(tenant_id)s |+--------------+--------------+-----------+----------------------------------------------------------------+
Courtesy of Chad Dougherty
27
Nova – OpenStack Compute• Use nova to host and manage cloud computing
systems.• Nova is a major part of an Infrastructure-as-a-
Service (IaaS) system. Main modules are implemented in Python (as are most components).
• OpenStack Compute, nova, interacts with– OpenStack Identity for authentication;– OpenStack Image service for disk and server images; and – OpenStack dashboard for the user and administrative
interface. • OpenStack Compute can scale horizontally on
standard hardware, and download images to launch instances.
Nova – API• nova-api service
– Accepts and responds to end user compute API calls. The service supports the OpenStack Compute API, the Amazon EC2 API, and a special Admin API for privileged users to perform administrative actions. It enforces some policies and initiates most orchestration activities, such as running an instance.
• nova-compute service– A worker daemon that creates and terminates virtual machine instances
through hypervisor APIs. For example:• XenAPI for XenServer/XCP• libvirt for KVM or QEMU• VMware API for VMware
– Processing is fairly complex. Basically, the daemon accepts actions from the queue and performs a series of system commands such as launching a KVM instance and updating its state in the database.
• nova-scheduler service– Takes a virtual machine instance request from the queue and determines
on which compute server host it runs.
http://docs.openstack.org/admin-guide-cloud/common/get_started_compute.html
28
Ceilometer – OpenStack Telemetry• Performs the following functions:– Polls metering data related to OpenStack services.– Collects event and metering data by monitoring
notifications sent from services.– Publishes collected data to various targets including
data stores and message queues.– Creates alarms when collected data breaks defined
rules.
Ceilometer – API• ceilometer-agent-compute - A compute agent
– Runs on each compute node and polls for resource utilization statistics. • ceilometer-agent-central - A central agent
– Runs on a central management server to poll for resource utilization statistics for resources not tied to instances or compute nodes. Multiple agents can be started to scale service horizontally.
• ceilometer-agent-notification - A notification agent– Runs on a central management server(s) and consumes messages from the message queue(s) to
build event and metering data.• ceilometer-collector - A collector
– Runs on central management server(s) and dispatches collected telemetry data to a data store or external consumer without modification.
• ceilometer-alarm-evaluator - An alarm evaluator– Runs on one or more central management servers to determine when alarms fire due to the
associated statistic trend crossing a threshold over a sliding time window.• ceilometer-alarm-notifier - An alarm notifier
– Runs on one or more central management servers to allow alarms to be set based on the threshold evaluation for a collection of samples.
• ceilometer-api - An API server– Runs on one or more central management servers to provide data access from the data store.
hEp://docs.openstack.org/admin-guide-cloud/common/get_started_telemetry.html
29
Storage Concepts
On-instance / ephemeral Block storage (cinder) Object Storage (swift)
• Runs operating systems and provides scratch space
• Used for adding additional persistent storage to a virtual machine (VM)
• Used for storing virtual machine images and data
• Persists until VM is terminated • Persists until deleted • Persists until deleted
• Access associated with a VM • Access associated with a VM • Available from anywhere
• Implemented as a filesystem underlying OpenStack Compute
• Mounted via OpenStack Block Storage controlled protocol (for example, NFS or iSCSI)
• REST API
• Encryption is available • Encryption is available • Easily scalable for future growth
• Administrator configures size setting, based on flavors • Sizings based on need
• Example: 10 GB first disk, 30 GB/core second disk • Example: 1 TB “extra hard drive” • Example: 10s of TBs of data set
storage
http://docs.openstack.org/admin-guide-cloud/common/get_started_storage_concepts.html
Horizon – OpenStack Dashboard• A modular Django web application that provides a graphical
interface to OpenStack services.• Not a necessary component to deploy OpenStack.
30
OpenStack and AWS
OpenStack AWS
Object Storage Swift S3
Block Storage Cinder Elastic Block Storage (EBS)
Compute Nova EC2
Telemetry Ceilometer CloudWatch
Data Processing Sahara EMR
OpenStack, Web Application
h3ps://www.openstack.org/assets/so;ware/mitaka/OpenStack-WorkloadRefArchWebApps-v7.pdf
31
Use Case, Web Application
https://www.openstack.org/software/sample-configs#web-applications
Web Tier, rendercontent for browser.
App Tier, process contentand business logic.
DB Tierstore data persistently.
Use Case, Big Data Analytics
h3ps://www.openstack.org/so<ware/sample-configs#big-data
32
Use Case, eCommerce
https://www.openstack.org/software/sample-configs#ecommerce
OpenStack – Example Architecture
33
OpenStack, evolving…
Next Time• Monday, 1/28/2019:– Encapsulation