Upload
tyler-fitzgerald
View
214
Download
0
Embed Size (px)
Citation preview
KCCMG 2005 IMPACTPAGE 1 Fujitsu Computer Systems
A Framework for Grid and Utility ComputingAutomatic Provisioning and Load Management
for Server Farms and Blade Servers
Joe Bell, Bob EquitzFujitsu Computer Systems
KCCMG - Fall IMPACT 2005
KCCMG 2005 IMPACTPAGE 2 Fujitsu Computer Systems
TRADEMARKS
• SuperDome and HP Open View Automation Manager are trademarks of HP
• PRIMEPOWER, PRIMEQUEST and Adaptive Services Control Center are trademarks of
Fujitsu Ltd., Fujitsu Computer Systems and Fujitsu-Siemens Corp.
• eServer, pSeries, P5, zSeries, z990, Tivoli Provisioning Manager (TPM), and Tivoli
Intelligent Orchestrator (TIO), JES2, JES3, NJE, YES/MVS, S/360 and SYSPLEX are
Trademarks of IBM.
• SunFire E25k, N1 Grid System, N1 Grid Service Provisioning System (GSPS), N1
Provisioning Server Blades Edition, N1 Grid Engine, and N1 System Manager are
Trademarks of SUN.
• OpForce is a Trademark of Veritas.
• EMC (Legato) AutoStart is a Trademark of EMC (Legato).
• All automated provisioning and virtualization concepts and notes relating to such, in this
presentation, are based on Fujitsu Siemens Corporation documentation material.
KCCMG 2005 IMPACTPAGE 3 Fujitsu Computer Systems
What are we going to cover?
-Joe Bell -• Define Grid and or Utility Computing• What’s wrong with today’s typical environment • The factors pushing Grid and Utility designs
– Business drivers– Scientific & engineering
• Why can’t we just make everything run faster?– Speed of light – Moore’s Law– Parallelism– Bottlenecks
KCCMG 2005 IMPACTPAGE 4 Fujitsu Computer Systems
What are we going to cover?
-Bob Equitz -• Review today’s environment• What’s the new technologies we need for Utilities/Grids?• Why Virtualize and what’s the general process?
– Remove boundaries– Implement management– Assign/provision services
• The Road Map to Autonomic Systems• The process to maintain Autonomic Systems
– Why are they important to our goals– General configuration end-to-end
• Product review• Summary
KCCMG 2005 IMPACTPAGE 5 Fujitsu Computer Systems
• Bell’s simple minded approach – start with the basics
• Merriam/Webster –
– Grid : A framework, or lattice or network or net or web …1 : GRATING2 a (1) : a perforated or ridged metal plate used as
a conductor in a storage battery (2) : an electrode consisting of a mesh or a spiral of fine wire in an electron tube (3) : a network of conductors for distribution of electric power; also : a network of radio or television stations b : a network of uniformly spaced horizontal and perpendicular lines (as for locating points on a map); also : something resembling such a network
What is Grid or Utility Computing Anyway?
KCCMG 2005 IMPACTPAGE 6 Fujitsu Computer Systems
• Merriam/Webster –
– Utility: Usefulness, service, convenience, function, …..1 : fitness for some purpose or worth to
some end2 : something useful or designed for use3 a : PUBLIC UTILITY b (1) : a service (as light, power, or water) provided by a public utility (2) : equipment or a piece of equipment to provide
such service or a comparable service4 : a program or routine designed to perform or facilitate especially routine operations (as
copying files or editing text) on a computer
What is Grid or Utility Computing Anyway?
KCCMG 2005 IMPACTPAGE 7 Fujitsu Computer Systems
• Bell’s feeble mind –– A Computing or IT Grid Utility:
• A network of heterogeneous computer server and storage platforms, providing the framework and infrastructure for the useful, convenient and functional computational services, that support a given set of business and / or scientific IT requirements.
– Or roll your own variation within the previous definition boundaries.
What is Grid or Utility Computing Anyway?
KCCMG 2005 IMPACTPAGE 8 Fujitsu Computer Systems
Is this really new stuff?
• Shared Disk – Loosely Coupled Systems • What was JES2/NJE. JES3? • Home Grown Load Balancers? • Resource Affinity Scheduling? • Serially Reusable Resource Controls Across Multiple Systems? • SYSPLEX • YES/MVS
KCCMG 2005 IMPACTPAGE 9 Fujitsu Computer Systems
Today‘s computing geography – Static and unshared islands
Dedicated IT systems Inflexible use of resources Low resources utilization Hard to manage High TCO, low ROI
KCCMG 2005 IMPACTPAGE 10 Fujitsu Computer Systems
Business Imperative: Increase Utilization
KCCMG 2005 IMPACTPAGE 11 Fujitsu Computer Systems
Other Business Requirements & Issues
– Data bases / files too large for required nightly linear or sequential massaging / Reorganization requirements ... Split it up to more isolated systems.
– Peak processing requirements requiring huge amounts of CPU that are not utilized during off peak time periods.
– Outage costs that are prevented with duplicate idle hardware resources and software licenses (HA Clustering)
– Transactions that must process too much data to meet SLAs – More smarts in the application or middle ware required – or a very big CPU for a relatively small gain.
– OSs and program products that don’t manage more than one instance very well, and/or don’t share resources .. (all for one – one for all)
• All of the above and more press IT in the direction of lowered resource utilizations or additional resources that are isolated and under utilized .. 180º away from the goal of higher utilizations!!
KCCMG 2005 IMPACTPAGE 12 Fujitsu Computer Systems
Business Objectives
Utilization before Grid Utility
0%
10%
50%
20%
30%
40%
60%
Utilization after Grid Utility
0%
10%
50%
20%
30%
40%
60%
Service before Grid Utility – typical bimodal
0%
20%
100%
40%
60%
80%
Service with Grid Utility
0%
20%
100%
40%
60%
80%
better utilization of hundreds of pooled systems more achievable.
• Administering and maintaining workloads - Provisioning
KCCMG 2005 IMPACTPAGE 13 Fujitsu Computer Systems
Todays Low Agility & Efficiency of IT
– “To prevent the data center from consuming the entire IT budget, increased manageability and utilization through standardization and automation are essential”
• Source: 2003 META Group – The Data Center of the Future
– 75% of all IT staff is absorbed by maintaining the existing IT • Source: Andy Butler, Gartner Group, April 2004
– Utilization of UNIX/Windows servers is low ( < 25% over 24 hours across all servers)
• Source: 2003 META Group – The Data Center of the Future
KCCMG 2005 IMPACTPAGE 14 Fujitsu Computer Systems
Scientific and Engineering Requirements
• The traditional scientific paradigm – First do theory on paper
– Then perform experiments to confirm or deny the theory.
• The traditional engineering paradigm – First do a design
– Then build a laboratory prototype.
• These paradigms are being replaced by numerical experiments and digital prototyping – why? – Real phenomena are too complicated to model on paper (e.g. climate
prediction).
– Real experiments are • too hard, too expensive, too slow, or too dangerous for a laboratory • e.g. oil reservoir simulation, large wind tunnels, overall aircraft design, galactic
evolution, whole factory or product life cycle design and optimization, weather prediction, nuclear fusion control etc.
KCCMG 2005 IMPACTPAGE 15 Fujitsu Computer Systems
Why even use a grid parallel process?
• Scientific and engineering problems requiring the most computing power to simulate are commonly called “Grand Challenges” or “largest problems”
– For example predicting the climate 50 years ahead is estimated to require computers computing at the rate of 1
TFLOP and with a memory size of 1 TB
• 1 MFLOP = 106 floating point operations per second
• 1 GFLOP = 109 floating point operations per second
• 1 TFLOP = 1012 floating point operations per second
KCCMG 2005 IMPACTPAGE 16 Fujitsu Computer Systems
• Weather prediction for one week requires 56 GFLOPS. – Climate prediction for 50 years requires 4.8 TFLOPS.
• The actual grid resolution used in climate codes today is 4 degrees of latitude by 5 degrees of longitude, or about 450 km by 560 km. – A near term goal is to improve this resolution to 2 degrees by 2.5
degrees, which is four times as much data. • NASA has launched weather satellites expecting to collect 1TB of
data per day for a period of years – Totaling > 6 PB (1015 bytes - Peta Bytes) of data over time.– No existing system is large enough to store this data today. – The Sequoia 2000 Global Change Research Project is
concerned with building this database. – http://appl.nasa.gov/pdf/
61537main_eosdis_case_study_602904.pdf
• Some other sites: http://www.cio.noaa.gov/hpcc/ http://www.noaa.gov/
A stake in the ground - Weather forecasting
KCCMG 2005 IMPACTPAGE 17 Fujitsu Computer Systems
Why Parallelism is Essential
• The clock speed is increasing – can’t we just push it all the way up to 1THz?
– 1 flop / Hz 1TFLOPS
– No - speed of light sets a limit upon the speed of a computer. • Now assume a completely sequential computer with 1 TB of memory
running at 1 TFLOP. – If the data has to travel a distance d to get from the memory to the CPU, and it has to
travel this distance 1012 times per second at the speed of light c=3x108 m/s, then d <= 3 *108 / 1012 = 0.3 mm.
– So the computer theoretically has to fit into a 0.3 mm cube.
• Now consider the 1TB memory. – Memory is conventionally built as a planar grid of bits, in our case say a 106 x 106 grid
of words. – If this grid is 0.3mm by 0.3mm, then one word occupies about 3 Angstroms (Å) by 3
Angstroms (.3x10-3/1x106 per side), or the size of a typical atom.
• Getting close to 3 Angstroms? 1nm= 10Å
– 45nm (current Fujitsu leading edge chip etching)= 450Å, – 450Å/ 3Å=> 150 atoms for the etching size
KCCMG 2005 IMPACTPAGE 18 Fujitsu Computer Systems
How Small Can We Go?
From the beginning to the present: on the left an early computing machine built from mechanical gears, on the right a state-of-the art IBM chip with 0.25 micron features. The production version will contain 200 million transistors.
http://www.qubit.org/library/intros/nano/nano.html
KCCMG 2005 IMPACTPAGE 19 Fujitsu Computer Systems
Nanocomputing with Quantum Effects
The transition from microtechnology to nanotechnology. The structure on the right is a single-electron transistor (SET) which was carved by the tip of a scanning tunneling microscope (STM). According to classical physics, there is no way that electrons can get from the 'source' to the 'drain', because of the two barrier walls either side of the 'island'. But the structure is so small that quantum effects occur, and electrons can, under certain circumstances, tunnel .through the barriers (but only one electron at a time can do this!). Thus the SET wouldn't work without quantum mechanics.
http://www.qubit.org/library/intros/nano/nano.html
KCCMG 2005 IMPACTPAGE 20 Fujitsu Computer Systems
Fast and Parallel • As of Jan 1996, the fastest machine then was an Intel Paragon with
6768 processors and a peak speed of 50 Megaflops/proc, for an overall peak speed of 6768*50 = 338 GFLOPS. – Doing Gaussian elimination, the machine got 281 GFLOPS on a
128600x128600 matrix; the whole problem takes 84min. • The Linpack Benchmark (component of SPECFP), sorts all machines by the speed with
which they can solve systems of linear equations Ax=b, of various dimensions, using Gaussian elimination.
• In the Netlib repository there is a long list of computers, together with performance benchmark information.
• As of June. 2005, the fastest machine on the TOP500 (see Top500.org) list is the IBM Blue Gene with a peak speed of 183 TFLOPS
• Trips, or the Tera-op Reliable Intelligently Adaptive Processing System. "Our goal is to exploit concurrency, …”
– Defense Advanced Research Projects Agency in its Polymorphous Computing Architectures project. DARPA, which is contributing $15.4 million to Trips, is looking for a chip that is able to scale to 1 trillion sustained operations (tera-op) per second on many applications http://www.computerworld.com/hardwaretopics/hardware/story/0,10801,104911,00.html?source=NLT_EMC&nid=104911
KCCMG 2005 IMPACTPAGE 21 Fujitsu Computer Systems
Writing Fast Programs is Hard
• Where do the FLOPS go? Why does the speed depend so much on the problem size? The answer lies in understanding the memory hierarchy. All computers, even cheap ones, look something like this (since IBM S/360 – or 3rd generation):
KCCMG 2005 IMPACTPAGE 22 Fujitsu Computer Systems
• The memory at the bottom level of the hierarchy, disk, is large, slow and cheap– Useful work, such as floating point operations, can only be done
on the data at the top of the hierarchy. – Transferring data among levels is slow, much slower than the rate
at which we can do useful work on data in the registers.
• In fact, this data transfer is the bottleneck in almost all computation and numerical analyses – More time is spent moving data in the hierarchy than doing useful
work. – These are the non-compute related tasks that significantly impact
scalability of compute clusters– Thus enhancing these systems provides future potential
Writing Fast Programs is Hard
KCCMG 2005 IMPACTPAGE 23 Fujitsu Computer Systems
Writing Fast Programs is Hard
• Good algorithmic designs require keeping active data near the top of the hierarchy for as long as possible, as well as minimizing movement between levels. – For many problems, like Gaussian elimination, only if the
problem is large enough, is there enough work to do at the top of the hierarchy to mask the time spent transferring data between lower levels. – Else, your no better than a few sequential processors…
• The more processors one has, the larger the problem has to be to mask this transfer time.
• These mechanisms are inherently inefficient
KCCMG 2005 IMPACTPAGE 24 Fujitsu Computer Systems
Writing Fast Programs is Hard
• Moore’s Law– Speeds of basic microprocessors grow by approximately a
factor of 2 every 18 months because; – Number of transistors doubles every 18 months – One of the reasons Moore's Law is true is that
microprocessor manufacturers are adopting many of the tricks of parallel computing and accounting for memory hierarchies.
– Getting the peak speed from the processor is becoming increasingly more difficult.
– Facet - there is no way around the issue today without radical new technology.
KCCMG 2005 IMPACTPAGE 25 Fujitsu Computer Systems
To COMPUTE or To COMMUNICATE?
• Which takes longer always depends upon – The application in hand– The speed of the processor - memory architecture– The speed of the network
• For a given problem, any of the above is a huge “bottleneck” – whether Business or Scientific Computing
• The bottleneck can be reduced – maybe ..
– at least partially, by introducing a large SMP based entities, as elements of the Grid/Utility with a massive interconnect backbones (e.g., HP SuperDome, Fujitsu PRIMEPOWER & PRIMEQUEST, eServer p5 Series, zSeries z990, Sun Fire E25K) to reconcile these mutually exclusive grid design constraints.
– Analyze the requirement for speed and availability versus costs : several large SMP’s versus large clusters of 1U commodity servers both arranged into a grid structure.
– Potential to produce a hybrid of both– Good R&D Project!
KCCMG 2005 IMPACTPAGE 26 Fujitsu Computer Systems
Which one would you challenge?”
James Montgomery Doohan (March 3, 1920 – July 20, 2005) was a Irish-Canadian character and voice actor best known for his portrayal of Scotty in the television and movie series Star Trek.
Pig in Mud
“Arguing with an engineer is like wrestling in mud with a pig: After a while you realize the pig likes it!!” --Mark Simmons, Sr. Consulting Engineer and Marketing Product Specialist, FCS
KCCMG 2005 IMPACTPAGE 27 Fujitsu Computer Systems
Today‘s computing geography – Static and unshared islands
• Inefficient
• Over / Under provisioned
• Hard to manage
• Inflexible
Remember; This is where most of us are today……………
KCCMG 2005 IMPACTPAGE 28 Fujitsu Computer Systems
Required Core Technologies
Virtualization
Separation of business applications and data from the need for dedicated technology
Automation
Automatic adjustment of platforms and infrastructure to changes in operation & environment
Integration
Low-cost, low-risk implementations & upgrades, re-usable technology, unified processes and services as validated product integration templates
KCCMG 2005 IMPACTPAGE 29 Fujitsu Computer Systems
Business Efficiency through Virtualization
• IT resources are shared, not isolated as in today’s “islands of computing” model
• Business priorities determine the allocation of IT resources
• Service levels are predictable and consistent, despite the unpredictable demands for IT services
Server Virtualization Storage Virtualization
Application
Services
KCCMG 2005 IMPACTPAGE 30 Fujitsu Computer Systems
Pooling and sharing of the overall resource
• Remove ServerBoundaries
Service A Service B Service C Service D Service E
Service A Service B Service C Service D Service E
KCCMG 2005 IMPACTPAGE 31 Fujitsu Computer Systems
• Remove ServerBoundaries
• ConsolidateStorage
Service A Service B Service C Service D Service E
Pooling and sharing of the overall resource
KCCMG 2005 IMPACTPAGE 32 Fujitsu Computer Systems
Pooling and sharing of the overall resource
• Remove Server Boundaries
• Establish overall management
• ConsolidateStorage
Service A Service B Service C Service D Service E
KCCMG 2005 IMPACTPAGE 33 Fujitsu Computer Systems
Pooling and sharing of the overall resource
ServiceServiceEE
ServiceServiceBB
ServiceServiceAA
ServiceServiceCC
ServiceServiceDD
0
10
20
30
40
50
60
A B C D E
Load
• Remove Server Boundaries
• Establish overall management
• Assign Services
• Consolidate Storage
KCCMG 2005 IMPACTPAGE 34 Fujitsu Computer Systems
Automatic provisioning & loadmanagement for Applications
ApplicationQoS Metrics
Applicationinstances
WorkloadGraph
Resourceallocation
KCCMG 2005 IMPACTPAGE 35 Fujitsu Computer Systems
QoS Monitoring & Management
High Water Mark
Low Water Mark
Target Metric Range
Time
QoS Metric
Measured QoS metric exceeds the specified maximum acceptable valueAllocate more satellite nodes and deploy needed application to meet QoS target
Measured QoS metric is below the specified minimal acceptable value (too many resources)
Perform orderly shutdown of some instances thus reducing cost and freeing the resources for other work.
KCCMG 2005 IMPACTPAGE 36 Fujitsu Computer Systems
Virtualization: On the way to Autonomic Systems
FLEXIBILITY
COST
AutonomicSystems
• Self configuring
• Self optimizing
• Self protecting
• Self healing
Resource management
• Standardization
• Consolidation
• AutomationVirtualization
• Dynamic provisioning
• Allocation policies
• Consistent QoS
KCCMG 2005 IMPACTPAGE 37 Fujitsu Computer Systems
Managing the autonomic cycle
• Autonomic system monitoring & control
• Scripts• Policies
• Commands• SNMP set• ...
• “Autonomic” rules• ...
• SNMP values• Commands• System
parameters• ...
• Event rules• Thresholds• ...
selfconfiguring
self optimizing
selfprotecting
selfhealing
Resource A(HW, FW, OS, Middleware
Application, System)
Resource B(HW, FW, OS, Middleware
Application, System)
Interface forMonitoring
Monitoring Execution
Event Generation Measures
Event Handling
Interface forControl
AutonomicCycle
KCCMG 2005 IMPACTPAGE 38 Fujitsu Computer Systems
Importance of Autonomic Functions
• Benefits include: – 1) fast and unattended adaptation of IT infrastructure to changing
business requirements;
– 2) automated monitoring and immediate reaction to changing workloads without operator intervention and with lower operating risks;
– 3) no changes to applications or operating systems are required. Transparently manages Linux or Windows or.. And applications.
– 4) better response to changing demands
– 5) easier accommodation of SLA’s
• Total effect: Further reduction of wasted or over utilized resources and reduction of personnel monitoring and manually adjusting the systems further reduction of TCO per unit of work accomplished.
KCCMG 2005 IMPACTPAGE 39 Fujitsu Computer Systems
PRIMERGY (e.g. BX600)Adaptive Service ManagerAdaptive Service Manager
Storage network
Control network
AdaptationAdaptation• RestartRestart• RemoteDeployRemoteDeploy
OSDeploy-ment
Server
Spare
Client network
Clients
Terminal Server Deploy
Console
MonitoringMonitoring
Shared storage (NAS)
Actions
Policies
Inventory
ImagesData Areas
Automatic provisioning
Efficient management of large environment
Automatic workload management
Automated policy-based management
Centralized fully automated operating system and application deployment
Single image administration
Tie it all together .. End –to- End Solution
KCCMG 2005 IMPACTPAGE 40 Fujitsu Computer Systems
Products for Investigation
• Fujitsu Siemens Adaptive Services Control Center 1.1: Automatic provision,
deploy, monitor, and allocate resources - controlling load, utilization and
service level and quality metrics for each application service, per user
requirements.
• HP Open View Automation Manager is a data center automation solution that
extends Open View Change and Configuration Management solution to
automatically re-provision resources in accordance with business priorities.
Automation Manager runs under Windows and supports Windows and Linux
servers.
• IBM Tivoli Provisioning Manager (TPM) combined with Tivoli Intelligent
Orchestrator (TIO) automates tasks in anticipation of, or in response to,
changing conditions. TIO manages pooled resources and prioritizes
allocations. TPM provisions resources. TIO monitors performance and decides
what actions to take in order to maintain committed application service levels.
KCCMG 2005 IMPACTPAGE 41 Fujitsu Computer Systems
Products for Investigation
• EMC (Legato) AutoStart is a cluster solution integrating EMC’s suite of
storage products with application availability. AutoStart supports
automated switching of servers, networks, and data.
• SUN N1 Grid System, “a collection of architectures, products, and
services...” products are available today – N1 Grid Service Provisioning
System (GSPS), N1 Provisioning Server Blades Edition, and N1 Grid
Engine. N1 System Manager GSPS automates application provisioning on
Solaris, Linux, AIX, and Windows servers.
• Veritas OpForce is based on software acquired from Jareva Technologies
in 2003. VERITAS positions OpForce for server automation and
provisioning, and managing IT resource lifecycle. OpForce automates
tasks associated with controlling, provisioning, and updating
heterogeneous data center environments, including bare-metal discovery,
resource pooling, and application and OS software deployment.
KCCMG 2005 IMPACTPAGE 42 Fujitsu Computer Systems
Summary
Requirements for Utility/Grid are driven by Business TCO pressures Scientific/Engineering problem solving requirements
Full-featured Framework for Utility/Grid Computing High availability High scalability Disaster recovery Automatic provisioning Enterprise applications Multi-platform support
Solaris, Linux, Windows, VMware, .........
... And easy installation / operation with instrumentation Self managing, self scaling, self healing, self adapting, auto
configuration updates