Cluster Computing Overview CS241 Winter 01 © 1999-2001 Armando Fox [email protected]

Cluster Computing OverviewCluster Computing Overview

CS241 Winter 01CS241 Winter 01© 1999-2001 Armando Fox© 1999-2001 Armando Fox

[email protected]@cs.stanford.edu

© 2001

Stanford

Today’s OutlineToday’s Outline

Clustering: the Holy GrailClustering: the Holy Grail The Case For NOWThe Case For NOW

Clustering and Internet ServicesClustering and Internet Services

Meeting the Cluster ChallengesMeeting the Cluster Challenges

Cluster case studiesCluster case studies GLUnixGLUnix

SNS/TACCSNS/TACC

DDS?DDS?

© 2001

Stanford

Cluster Prehistory: Tandem Cluster Prehistory: Tandem NonStopNonStop

Early (1974) foray into transparent fault tolerance Early (1974) foray into transparent fault tolerance through redundancythrough redundancy Mirror everything (CPU, storage, power supplies…), can Mirror everything (CPU, storage, power supplies…), can

tolerate any single fault (later: processor duplexing)tolerate any single fault (later: processor duplexing)

““Hot standby” process pair approachHot standby” process pair approach

What’s the difference between What’s the difference between high availabilityhigh availability and and fault fault tolerance?tolerance?

NoteworthyNoteworthy ““Shared nothing”--why?Shared nothing”--why?

Performance and efficiency costs?Performance and efficiency costs?

Later evolved into Tandem Himalaya, which used Later evolved into Tandem Himalaya, which used clustering for clustering for bothboth higher performance and higher higher performance and higher availabilityavailability

© 2001

Stanford

Pre-NOW Clustering in the 90’sPre-NOW Clustering in the 90’s

IBM Parallel Sysplex and DEC OpenVMSIBM Parallel Sysplex and DEC OpenVMS Targeted at conservative (read: mainframe) customersTargeted at conservative (read: mainframe) customers

Shared disks allowed under both (why?)Shared disks allowed under both (why?)

All devices have cluster-wide names (shared everything?)All devices have cluster-wide names (shared everything?)

1500 installations of Sysplex, 25,000 of OpenVMS Cluster1500 installations of Sysplex, 25,000 of OpenVMS Cluster

Programming the clustersProgramming the clusters All System/390 and/or VAX VMS subsystems were All System/390 and/or VAX VMS subsystems were

rewritten to be cluster-awarerewritten to be cluster-aware

OpenVMS: cluster support exists even in single-node OS!OpenVMS: cluster support exists even in single-node OS!

An advantage of locking into proprietary interfaceAn advantage of locking into proprietary interface

© 2001

Stanford

Networks of Workstations: Holy Networks of Workstations: Holy GrailGrail

Use clusters of workstations instead of a supercomputerUse clusters of workstations instead of a supercomputer ..

The case for NOWThe case for NOW difficult for custom designs to track technology trends (e.g. uproc difficult for custom designs to track technology trends (e.g. uproc

perf. increases at 50%/yr, but design cycles are 2-4 yrs)perf. increases at 50%/yr, but design cycles are 2-4 yrs)

No economy of scale in 100s => +$No economy of scale in 100s => +$

Software incompatibility (OS & apps) => +$$$$Software incompatibility (OS & apps) => +$$$$

““Scale makes availability affordable” (Pfister)Scale makes availability affordable” (Pfister)

““systems of systems” can aggressively use off-the-shelf hardware systems of systems” can aggressively use off-the-shelf hardware and OS softwareand OS software

New challenges (“the case against NOW”):New challenges (“the case against NOW”): performance and bug-tracking vs. dedicated systemperformance and bug-tracking vs. dedicated system

underlying system is changing underneath youunderlying system is changing underneath you

underlying system is poorly documentedunderlying system is poorly documented

© 2001

Stanford

Clusters: “Enhanced Standard Clusters: “Enhanced Standard Litany”Litany”

Hardware redundancyHardware redundancy

Aggregate capacityAggregate capacity

Incremental scalabilityIncremental scalability

Absolute scalabilityAbsolute scalability

Price/performance Price/performance sweet spotsweet spot

Software engineeringSoftware engineering

Partial failure Partial failure managementmanagement

Incremental scalabilityIncremental scalability

System administrationSystem administration

HeterogeneityHeterogeneity

© 2001

Stanford

Clustering and Internet ServicesClustering and Internet Services

Aggregate capacityAggregate capacity TB of disk storage, THz of compute power (if we can TB of disk storage, THz of compute power (if we can

harness in parallel!)harness in parallel!)

RedundancyRedundancy Partial failure behavior: only small fractional degradation Partial failure behavior: only small fractional degradation

from loss of one nodefrom loss of one node

Availability: industry average across “large” sites during Availability: industry average across “large” sites during 1998 holiday season was 97.2% availability (source: 1998 holiday season was 97.2% availability (source: CyberAtlas)CyberAtlas)

Compare: mission-critical systems have “four nines” Compare: mission-critical systems have “four nines” (99.99%)(99.99%)

© 2001

Stanford

Clustering and Internet WorkloadsClustering and Internet Workloads

Internet vs. “traditional” workloadsInternet vs. “traditional” workloads e.g. Database workloads (TPC benchmarks)e.g. Database workloads (TPC benchmarks)

e.g. traditional scientific codes (matrix multiply, simulated e.g. traditional scientific codes (matrix multiply, simulated annealing and related simulations, etc.)annealing and related simulations, etc.)

Some characteristic differencesSome characteristic differences Read mostlyRead mostly

Quality of service (best-effort vs. guarantees)Quality of service (best-effort vs. guarantees)

Task granularityTask granularity

““Embarrasingly parallel”…why?Embarrasingly parallel”…why? HTTP is stateless with short-lived requestsHTTP is stateless with short-lived requests

Web’s architecture has already forced app designers to Web’s architecture has already forced app designers to work around this! (not obvious in 1990)work around this! (not obvious in 1990)

© 2001

Stanford

Meeting the Cluster ChallengesMeeting the Cluster Challenges

Software & programming modelsSoftware & programming models

Partial failure and application semanticsPartial failure and application semantics

System administrationSystem administration

Two case studies to contrast programming modelsTwo case studies to contrast programming models GLUnix goal: support “all” traditional Unix apps, providing GLUnix goal: support “all” traditional Unix apps, providing

a single system imagea single system image

SNS/TACC goal: simple programming model for Internet SNS/TACC goal: simple programming model for Internet services (caching, transformation, etc.), with good services (caching, transformation, etc.), with good robustness and easy administrationrobustness and easy administration

© 2001

Stanford

Software ChallengesSoftware Challenges

What is the programming model for clusters?What is the programming model for clusters? Explicit message passing (e.g. Active Messages)Explicit message passing (e.g. Active Messages)

RPC (but remember the problems that make RPC hard)RPC (but remember the problems that make RPC hard)

Shared memory/network RAM (e.g. Yahoo! directory)Shared memory/network RAM (e.g. Yahoo! directory)

Traditional OOP with object migration (“network Traditional OOP with object migration (“network transparency”): not relevant for Internet workload?transparency”): not relevant for Internet workload?

Programming model should support decent failure Programming model should support decent failure semantics and exploit inherent modularity of clusterssemantics and exploit inherent modularity of clusters Traditional uniprocessor programming idioms/models don’t Traditional uniprocessor programming idioms/models don’t

seem to scale up to clustersseem to scale up to clusters

Question: Is there a “natural to use” cluster model that scales Question: Is there a “natural to use” cluster model that scales down to uniprocessors, at least for Internet-like workloads?down to uniprocessors, at least for Internet-like workloads?

Later in the quarter we’ll take a shot at thisLater in the quarter we’ll take a shot at this

© 2001

Stanford

Partial Failure ManagementPartial Failure Management

What does What does partial failure partial failure mean for…mean for… a transactional database?a transactional database?

A read-only database striped across cluster nodes?A read-only database striped across cluster nodes?

A compute-intensive shared service?A compute-intensive shared service?

What are appropriate “partial failure What are appropriate “partial failure abstractions”?abstractions”? Incomplete/imprecise results?Incomplete/imprecise results?

Longer latency?Longer latency?

What current programming idioms make partial What current programming idioms make partial failure hard?failure hard? Hint: remember the original RPC papers?Hint: remember the original RPC papers?

© 2001

Stanford

System Administration on a System Administration on a ClusterCluster

Thanks to Eric Anderson (1998) for some of this material.Thanks to Eric Anderson (1998) for some of this material.

Total cost of ownership (TCO) way high for clusters Total cost of ownership (TCO) way high for clusters due to administration costsdue to administration costs

Previous SolutionsPrevious Solutions Pay someone to watchPay someone to watch

Ignore or wait for someone to complainIgnore or wait for someone to complain

““Shell Scripts From Hell” (not general Shell Scripts From Hell” (not general vast repeated vast repeated work)work)

Need an extensible and scalable way to automate Need an extensible and scalable way to automate the gathering, analysis, and presentation of datathe gathering, analysis, and presentation of data

© 2001

Stanford

System Administration, cont’d.System Administration, cont’d.

Extensible Scalable Monitoring For Clusters of Extensible Scalable Monitoring For Clusters of Computers Computers (Anderson & Patterson, UC Berkeley)(Anderson & Patterson, UC Berkeley)

Relational tables allow properties & queries of Relational tables allow properties & queries of interest to evolve as the cluster evolvesinterest to evolve as the cluster evolves

Extensive visualization support allows humans to Extensive visualization support allows humans to make sense of masses of datamake sense of masses of data

Multiple levels of caching decouple data collection Multiple levels of caching decouple data collection from aggregationfrom aggregation

Data updates can be “pulled” on demand or Data updates can be “pulled” on demand or triggered by pushtriggered by push

© 2001

Stanford

Visualizing Data: ExampleVisualizing Data: Example

Display aggregates of various interesting Display aggregates of various interesting machine properties on the NOW’smachine properties on the NOW’s

Note use of aggregation, colorNote use of aggregation, color

© 2001

Stanford

Case Study: The Berkeley NOWCase Study: The Berkeley NOW

History and History and PicturesPictures of an early research cluster of an early research cluster NOW-0: four HP-735’sNOW-0: four HP-735’s

NOW-1: 32 headless Sparc-10’s and Sparc-20’sNOW-1: 32 headless Sparc-10’s and Sparc-20’s

NOW-2: 100 UltraSparc 1’s, Myrinet interconnectNOW-2: 100 UltraSparc 1’s, Myrinet interconnect

inktomi.berkeley.edu: four Sparc-10’sinktomi.berkeley.edu: four Sparc-10’s

www.hotbot.com: 160 Ultra’s, 200 CPU’s totalwww.hotbot.com: 160 Ultra’s, 200 CPU’s total

NOW-3: eight 4-way SMP’sNOW-3: eight 4-way SMP’s

Myrinet interconnectionMyrinet interconnection In addition to commodity switched EthernetIn addition to commodity switched Ethernet

Originally Sparc SBus, now available on PCIbusOriginally Sparc SBus, now available on PCIbus

© 2001

Stanford

The Adventures of NOW: The Adventures of NOW: ApplicationsApplications

AlphaSort: 8.41 GB in one minute, 95 UltraSparcsAlphaSort: 8.41 GB in one minute, 95 UltraSparcs runner up: Ordinal Systems runner up: Ordinal Systems nSort nSort on SGI Origin, 5 GB)on SGI Origin, 5 GB)

pre-1997 record, 1.6 GB on an SGI Challengepre-1997 record, 1.6 GB on an SGI Challenge

40-bit DES key crack in 3.5 hours40-bit DES key crack in 3.5 hours ““NOW+”: headless and some headed machinesNOW+”: headless and some headed machines

inktomi.berkeley.edu (now inktomi.com)inktomi.berkeley.edu (now inktomi.com) now fastest search engine, largest aggregate capacitynow fastest search engine, largest aggregate capacity

TranSend proxy & Top Gun Wingman Pilot browserTranSend proxy & Top Gun Wingman Pilot browser ~15,000 users, 3-10 machines~15,000 users, 3-10 machines

© 2001

Stanford

NOW: GLUnixNOW: GLUnix

Original goals:Original goals: High availability through redundancyHigh availability through redundancy

Load balancing, self-managementLoad balancing, self-management

Binary compatibilityBinary compatibility

Both batch and parallel-job supportBoth batch and parallel-job support

I.e., single system image for NOW usersI.e., single system image for NOW users Cluster abstractions == Unix abstractionsCluster abstractions == Unix abstractions

This is both good and bad…what’s missing compared to This is both good and bad…what’s missing compared to early 90’s proprietary cluster systems?early 90’s proprietary cluster systems?

For portability and rapid development, build on top For portability and rapid development, build on top of off-the-shelf OS (Solaris)of off-the-shelf OS (Solaris)

© 2001

Stanford

GLUnix ArchitectureGLUnix Architecture

Master collects load, status, etc. info from daemonsMaster collects load, status, etc. info from daemons Repository of cluster state, centralized resource allocationRepository of cluster state, centralized resource allocation

Pros/cons of this approach?Pros/cons of this approach?

Glib app library talks to GLUnix master as app proxyGlib app library talks to GLUnix master as app proxy Signal catching, process mgmt, I/O redirection, etc.Signal catching, process mgmt, I/O redirection, etc.

Death of daemon is treated as a SIGKILL by masterDeath of daemon is treated as a SIGKILL by master

GLUnixMaster

NOW node

glud daemon

NOW node

glud daemon

NOW node

glud daemon

1 per cluster

© 2001

Stanford

GLUnix RetrospectiveGLUnix Retrospective Trends that changed the assumptionsTrends that changed the assumptions

SMP’s have replaced MPP’s, and are tougher to compete with for SMP’s have replaced MPP’s, and are tougher to compete with for MPP workloadsMPP workloads

Kernels have become extensibleKernels have become extensible

Final features vs. initial goalsFinal features vs. initial goals Tools: Tools: glurun, glumake glurun, glumake (2nd most popular use of NOW!)(2nd most popular use of NOW!), ,

glups/glukill, glustat, glureserveglups/glukill, glustat, glureserve

Remote execution--but not total transparencyRemote execution--but not total transparency

Load balancing/distribution--but not transparent Load balancing/distribution--but not transparent migration/failovermigration/failover

Redundancy for high availability--but not for the “GLUnix master” Redundancy for high availability--but not for the “GLUnix master” nodenode

Philosophy:Philosophy: Did GLUnix ask the right question (for our Did GLUnix ask the right question (for our purposes)?purposes)?

© 2001

Stanford

TACC/SNSTACC/SNS

Specialized cluster runtime to host Web-like Specialized cluster runtime to host Web-like workloadsworkloads TACC: transformation, aggregation, caching and TACC: transformation, aggregation, caching and

customization--elements of an Internet servicecustomization--elements of an Internet service

Build apps from composable modules, Unix-pipeline-styleBuild apps from composable modules, Unix-pipeline-style

Goal: complete separation of Goal: complete separation of *ility*ility concerns from concerns from application logicapplication logic Legacy code encapsulation, multiple language supportLegacy code encapsulation, multiple language support

Insulate programmers from nasty engineeringInsulate programmers from nasty engineering

© 2001

Stanford

TACC ExamplesTACC Examples

Simple search engineSimple search engine Query crawler’s DBQuery crawler’s DB

Cache recent searchesCache recent searches

Customize UI/presentationCustomize UI/presentation

Simple transformation proxySimple transformation proxy On-the-fly lossy compression of inline On-the-fly lossy compression of inline

images (GIF, JPG, etc.)images (GIF, JPG, etc.)

Cache original & transformedCache original & transformed

User specifies aggressiveness, User specifies aggressiveness, “refinement” UI, etc.“refinement” UI, etc.

C TT

$$AA

TT

$$

C

DBDB

htmlhtml

© 2001

Stanford

Cluster-Based TACC ServerCluster-Based TACC Server Component replication for scaling and availabilityComponent replication for scaling and availability High-bandwidth, low-latency interconnectHigh-bandwidth, low-latency interconnect Incremental scaling: commodity PC’sIncremental scaling: commodity PC’s

C$

LB/FT

Interconnect

FE

$ $

WWWT

FE

FE

WWWA

GUI

Front EndsFront EndsFront EndsFront Ends CachesCachesCachesCaches User ProfileUser ProfileDatabaseDatabase

User ProfileUser ProfileDatabaseDatabase

WorkersWorkersWorkersWorkersLoad Balancing &Load Balancing &Fault ToleranceFault Tolerance

Load Balancing &Load Balancing &Fault ToleranceFault Tolerance

AdministrationAdministrationInterfaceInterface

AdministrationAdministrationInterfaceInterface

© 2001

Stanford

““Starfish” Availability: LB DeathStarfish” Availability: LB Death

FE detects via broken pipe/timeout, restarts LBFE detects via broken pipe/timeout, restarts LB

C$

Interconnect

FE

$ $

WWWT

FE

FE

LB/FT

WWWA

© 2001

Stanford



C$

Interconnect

FE

$ $

WWWT

FE

FE

LB/FT

WWWA

LB/FT

New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables

If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure

© 2001

Stanford



C$

Interconnect

FE

$ $

WWWT

FE

FE

LB/FT

WWWA

New LB announces itself (multicast), contacted by workers, New LB announces itself (multicast), contacted by workers, gradually rebuilds load tablesgradually rebuilds load tables

If partition heals, extra LB’s commit suicideIf partition heals, extra LB’s commit suicideFE’s operate using cached LB info during failureFE’s operate using cached LB info during failure

© 2001

Stanford

SNS Availability MechanismsSNS Availability Mechanisms

Soft state everywhereSoft state everywhere Multicast based announce/listen to refresh the stateMulticast based announce/listen to refresh the state

Idea stolen from multicast routing in the Internet!Idea stolen from multicast routing in the Internet!

Process peers watch each otherProcess peers watch each other Because of no hard state, “recovery” == “restart”Because of no hard state, “recovery” == “restart”

Because of multicast level of indirection, don’t need a Because of multicast level of indirection, don’t need a location directory for resourceslocation directory for resources

Load balancing, hot updates, migration are “easy”Load balancing, hot updates, migration are “easy” Shoot down a worker, and it will recoverShoot down a worker, and it will recover

Upgrade == install new software, shoot down oldUpgrade == install new software, shoot down old

Mostly graceful degradationMostly graceful degradation

© 2001

Stanford

SNS Availability Mechanisms, SNS Availability Mechanisms, cont’d.cont’d.

Orthogonal mechanismsOrthogonal mechanisms Composition without interfacesComposition without interfaces

Example: Scalable Reliable Multicast (SRM) group state Example: Scalable Reliable Multicast (SRM) group state management with SNSmanagement with SNS

Eliminates O(nEliminates O(n22) complexity of composing modules) complexity of composing modules

State space of failure mechanisms is easy to reason State space of failure mechanisms is easy to reason aboutabout

What’s the cost?What’s the cost?

More on orthogonal mechanisms laterMore on orthogonal mechanisms later

© 2001

Stanford

Administering SNSAdministering SNS

Multicast meansMulticast meansmonitor can runmonitor can runanywhere on clusteranywhere on cluster

Extensible via self-describing data structures and mobile code in Tcl

© 2001

Stanford

Clusters SummaryClusters Summary

Many approaches to clustering, software Many approaches to clustering, software transparency, failure semanticstransparency, failure semantics An end-to-end problem that is often application-specificAn end-to-end problem that is often application-specific

We’ll see this again at the application level in harvest vs. We’ll see this again at the application level in harvest vs. yield discussionyield discussion

Internet workloads are a particularly good match Internet workloads are a particularly good match for clustersfor clusters What software support is needed to mate these two What software support is needed to mate these two

things?things?

What new abstractions do we want for writing failure-What new abstractions do we want for writing failure-tolerant applications in light of these techniques?tolerant applications in light of these techniques?

Documents

Cluster Computing Overview CS241 Winter 01 © 1999-2001 Armando Fox [email protected]