What Is High Throughput Distributed Computing

[email protected] foil 1 last update: 11/04/23 22:53

CERN

What is High Throughput Distributed Computing?

CERN Computing Summer School 2001

Santander

Les RobertsonCERN - IT Division

[email protected]

[email protected] foil 2 last update 11/04/23 22:53

CERN

Outline

High Performance Computing (HPC) and High Throughput Computing (HTC)

Parallel processing so difficult with HPC applications so easy with HTC

Some models of distributed computing

HEP applications Offline computing for LHC Extending HTC to the Grid


CERN

“Speeding Up” the Calculation?

Use the fastest processor available

-- but this gives only a small factor over modest (PC) processors

Use many processors, performing bits of the problem in parallel

-- and since quite fast processors are inexpensive we can think

of using very many processors in parallel


CERN

High Performance – or – High Throughput?

The key questions are – granularity & degree of parallelism Have you got one big problem or a bunch of little ones?

To what extent can the “problem” be decomposed into sort-of-independent parts (grains) that can all be processed in parallel?

Granularity fine-grained parallelism – the independent bits are small,

need to exchange information, synchronise often coarse-grained – the problem can be decomposed into

large chunks that can be processed independently

Practical limits on the degree of parallelism – how many grains can be processed in parallel? degree of parallelism v. grain size grain size limited by the efficiency of the system at

synchronising grains


CERN

High Performance – v. – High Throughput?

fine-grained problems need a high performance system that enables rapid synchronisation between the bits that

can be processed in parallel and runs the bits that are difficult to parallelise as fast as

possible coarse-grained problems can use a high throughput

system that maximises the number of parts processed per minute

High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected

while High Performance Systems use a smaller number of more expensive processors expensively interconnected


CERN

High Performance – v. – High Throughput?

There is nothing fundamental here – it is just a question of financial trade-offs like:

how much more expensive is a “fast” computer than a bunch of slower ones?

how much is it worth to get the answer more quickly? how much investment is necessary to improve the degree

of parallelisation of the algorithm?

But the target is moving - Since the cost chasm first opened between fast and

slower computers 12-15 years ago an enormous effort has gone into finding parallelism in “big” problems

Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities –

clusters fabrics (inter)national grids

demanding ever-greater degrees of parallelism


CERN

High Performance Computing


CERN

A quick look at HPC problems

Classical high-performance applications numerical simulations of complex systems such as

weather climate combustion mechanical devices and structures crash simulation electronic circuits manufacturing processes chemical reactions

image processing applications like medical scans military sensors earth observation, satellite reconnaisance seismic prospecting


CERN

Approaches to parallelism

Domain decomposition

Functional decomposition

graphics from Designing and Building Parallel Programs (Online), by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/


CERN

Of course – it’s not that simple

graphic from Designing and Building Parallel Programs (Online), by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/


CERN

The design process

Data or functional decomposition building an abstract task model

Building a model for communicationbetween tasks

interaction patterns

Agglomeration – to fit the abstractmodel to the constraints of the target hardware

interconnection topology speed, latency, overhead of

communications

Mapping the tasks to the processors load balancing task scheduling

graphic from Designing and Building Parallel Programs (Online),by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/

http://www-unix.mcs.anl.gov/dbpp/












CERN

Large scale parallelism – the need for standards

“Supercomputer” market is on trouble; diminishing number of suppliers; questionable future

Increasingly risky to design for specific tightly coupled architectures like - SGI (Cray, Origin), NEC, Hitachi

Require a standard for communication between partitions/tasks that works also on loosely coupled systems (“massively parallel processors” – MPP – IBM SP, Compaq)

Paradigm is message passing rather than shared memory – tasks rather than threads

Parallel Virtual Machine - PVM MPI – Message Passing Interface


CERN

MPI – Message Passing Interface

industrial standard – http://www.mpi-forum.org source code portability widely available; efficient implementations

SPMD (Single Program Multiple Data) model Point-to-point communication (send/receive/wait;

blocking/non-blocking) Collective operations (broadcast; scatter/gather; reduce) Process groups, topologies

comprehensive and rich functionality

http://www.mpi-forum.org/




CERN

MPI – Collective operations

IBM Redbook - http://www.redbooks.ibm.com/redbooks/SG245380.html

Defining high level datafunctions allows highlyefficient implementations,e.g. minimising data copies

http://www.redbooks.ibm.com/redbooks/SG245380.html








CERN

The limits of parallelism - Amdahl’s Law

If we have N processors:

s + p Speedup = ———— s + p/N

taking s as the fraction of the time spent in the sequential part of the program ( s + p = 1)

1 Speedup = ———— 1/s

s + (1-s)/N

s – time spent in a serial processor on serial parts of the code

p – time spent in a serial processor on parts that could be executed in parallel

Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capabilityProc. AFIPS Conf., Reston, VA, 1967, pp. 483-485


CERN

Amdahl’s Law - maximum speedup

max speedup

0

20

40

60

80

100

120

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

sequential part as percentage of total time

sp

ee

du

p


CERN

Load Balancing - real life is (much) worse

Often have to use barrier synchronisation between each step, and different cells require different amounts of computation

Real time sequential part s = si

Real time parallelisable part on a sequential processor p = k jpk

j

Real time parallelised T = s + max(pkj)

T = s + max(pkj) >> s + p/N

s1

si

sj

sN

… …pk

1

pkj pk

M

… …pK1 pK

j

pKM

… …p11 p1

j p1M

:

:

:

:

:

:t


CERN

Gustafson’s Interpretation

The problem size scales with the number of processors

With a lot more processors (computing capacity) available you can and will do much more work in less time

The complexity of the application rises to fill the capacity available

But the sequential part remains approximately constant

Gustafson, J.L., Re-evaluating Amdahl’s Law, CACM 31(5), 1988, pp. 532-533


CERN

Amdahl’s Law - maximum speedup with Gustafson’s appetite

max speedup

0

500

1000

1500

2000

2500

0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0%

sequential part as % of total time

sp

ee

du

p

potential 1,000 X speedup with 0.1% sequential code


CERN

The importance of the network

Communication Overhead adds to the inherent sequential part of the program to limit the Amdahl speedup

Latency – the round-trip time (RTT) to communicate between two processors

communications overhead

c = latency + data_transfer_time

s + p

Speedup = ————— s + c + p/N

For fine grained parallel programs the problem is latency, not bandwidtht

… …

… …

… …

sequential

communications overhead

parallelisable


CERN

Latency

Comparison – Efficient MPI implementation on Linux cluster (source: Real World Computing Partnership, Tsukuba Research Center)

Network Bandwidth(MByte/sec)

RTT Latency(microsecond)

Myrinet 146 20

Gigabit Ethernet(Sysconect) 73 61

Fast Ethernet(EEPRO100) 11 100


CERN

High Throughput Computing


CERN

High Throughput Computing - HTC

Roughly speaking –

HPC – deals with one large problem HTC – is appropriate when the problem can be decomposed into

many (very many) smaller problems that are essentially independent

Build a profile of all MasterCard customers who purchased an airline ticket and rented a car in August

Analyse the purchase patterns of Wallmart customers in the LA area last month

Generate 106 CMS events Web surfing, Web searching Database queries

HPC – problems that are hard to parallelise – single processor performance is important

HTC – problems that are easy to parallelise – can be adapted to very large numbers of processors


CERN

HTC - HPC

High Performance

Granularity largely defined by the algorithm, limitations in the hardware Load balancing difficult Hard to schedule different workloads Reliability is all important

if one part fails the calculation stops (maybe even aborts!) check-pointing essential – all the processes must be restarted from the same synchronisation point hard to dynamically re- configure for smaller number of processors

High Throughput

Granularity can be selected to fit the environment

Load balancing easy Mixing workloads is easy

Sustained throughput is the key goal

the order in which the individual tasks execute is (usually) not important if some equipment goes down the work can be re-run later easy to re-schedule dynamically the workload to different configurations


CERN

Distributed Computing


CERN

Distributed Computing

Local distributed systems Clusters Parallel computers (IBM SP)

Geographically distributed systems Computational Grids

HPC – as we have seen Needs low latency AND good communication bandwidth

HTC distributed systems The bandwidth is important, the latency is less significant If latency is poor more processes can be run in parallel to

cover the waiting time


CERN

Shared Data

If the granularity is course enough –the different parts of the problem can be synchronised simply by sharing data

Example – event reconstruction all of the events to be reconstructed are stored in a large

data store processes (jobs) read successive raw events, generating

processed event records, until there are no raw events left

the result is the concatenation of the processed events (and folding together some histogram data)

synchronisation overhead can be minimised by partitioning the input and output data


CERN

Data Sharing - Files

Global file namespace maps universal name to network node, local name

Remote data access Caching strategies

Local or intermediate caching Replication Migration

Access control, authentication issues Locking issues

NFS AFS Web folders

Highly scalable for read-only data


CERN

Data Sharing – Databases, Objects

File sharing is probably the simplest paradigm for building distributed systems

Database and object sharing look the same

But – Files are universal, fundamental systems concepts –

standard interfaces, functionality Databases are not yet fundamental, built-in

but there are only a few standards Objects even less so – still at the application level – so

harder to implement efficient and universal caching, remote access, etc.


CERN

Client-server

Examples Web browsing Online banking Order entry ……..

The functionality is divided between the two parts – for example exploit locality of data (e.g. perform searches, transformations on

node where data resides) exploit different hardware capabilities (e.g. central supercomputer,

graphics workstation) security concerns – restrict sensitive data to defined geographical

locations (e.g. account queries) reliability concerns (e.g. perform database updates on highly reliable

servers)

Usually the server implements pre-defined, standardised functions

client serverrequest

response


CERN

server

client

client

client

server

client

client

client

server

client

client

client

database

server

• data extracts replicated on intermediate servers• changes batched for asynchronous treatment by database server

Enables - • scaling up client query capacity• isolation of main database

3-Tier client-server


CERN

Peer-to-Peer - P2P

Peer-to-Peer decentralisation of function and control Taking advantage of the computational resources at the edge

of the network The functions are shared between the distributed parts –

without central control Programs to cooperate without being designed as a single

application So P2P is just a democratic form of parallel programming -

SETI The parallel HPC problems we have looked at, using MPI

All the buzz of P2P is because new interfaces promise to bring this to the commercial world; allow different communities, businesses to collaborate through the internet

XML SOAP .NET JXTA


CERN

Simple Object Access Protocol - SOAP

SOAP – simple, lightweight mechanism for exchanging objects between peers in a distributed environment using XML carried over HTTP

SOAP consists of three parts: The SOAP envelope - what is in a message; who should

deal with it, and whether it is optional or mandatory The SOAP encoding rules - serialisation definition for

exchanging instances of application-defined datatypes. The SOAP Remote Procedure Call representation


CERN

Microsoft’s .NET

.NET is a framework, or environment for building, deploying and running Web services and other internet applications

Common Language Runtime - C++, C#, Visual Basic and JScript

Framework classes Aiming at a standard but Windows only


CERN

JXTA

Interoperability locating JXTA peers communication

Platform, language and network independence Implementable on anything –

phone – VCR - PDA – PC A set of protocols Security model Peer discovery Peer groups XML encoding

http://www.jxta.org/project/www/docs/TechOverview.pdf


CERN

End of Part 1

Tomorrow:

HEP applicationsOffline computing for LHCExtending HTC to the Grid


CERN

HEP Applications

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

les.

rob

ert

son

@ce

rn.c

h

CERN


CERN

HEP Computing Characteristics

Large numbers of independent events - trivial parallelism – “job” granularity

Modest floating point requirement - SPECint performance Large data sets - smallish records, mostly read-only Modest I/O rates - few MB/sec per fast processor

Simulation cpu-intensive mostly static input data very low output data rate

Reconstruction very modest I/O easy to partition input data easy to collect output data


CERN

Analysis

• ESD analysis

• modest I/O rates

• read only ESD

BUT

Very large input database

Chaotic workload –

• unpredictable, no limit to the requirements

• AOD analysis

• potentially very high I/O rates

• but modest database


CERN

HEP Computing Characteristics

Large numbers of independent events - trivial parallelism – “job” granularity

Large data sets - smallish records, mostly read-only

Modest I/O rates - few MB/sec per fast processor

Modest floating point requirement - SPECint performance

Chaotic workload –

• research environment unpredictable, no limit to the requirements

Very large aggregate requirements – computation, data• Scaling up is not just big – it is also complex• …and once you exceed the capabilities of a single

geographical installation ………?


CERN

Task Farming


CERN

Task Farming

Decompose the data into large independent chunks Assign one task (or job) to each chunk Put all the tasks in a queue for a scheduler

which manages a large “farm” of processors, each of which has access to all of the data

The scheduler runs one or more jobs on each processor When a job finishes the next job in the queue is started Until all the jobs have been run Collect the output files


CERN

Task Farming

Task farming is good for

a very large problem

Which has

selectable granularity

largely independent tasks

loosely shared data

HEP –-- Simulation-- Reconstruction-- and much of the Analysis


CERN

The SHIFT Software Model (1990)

les.

rob

ert

son

@ce

rn.c

h

diskservers

applicationservers

stage (migration)servers

tapeservers

queueservers

IP network

From the application’s viewpoint – this is simply file sharing –

all data available to all processes

standard APIs –disk I/O; mass storage; job scheduler; can be implemented over an IP network

mass storage model – tape data cached on disk (stager)

physical implementation - transparent to the application/user

scalable, heterogeneous

flexible evolution – scalable capacity; multiple platforms;seamless integration of new technologies


CERN

Current Implementation of SHIFT

massstorage

applicationservers

WAN

data cache

racks of dual-cpuLinux PCsLinux PC controllers

IDE disks

Linux PC controllersRobots – STK Powderhorn

Drives - STK 9840,STK 9940, IBM 3590

Ethernet100BaseT, Gigabit


CERN

Fermilab Reconstruction Farms

1991 – farms of RISC workstations introduced for reconstruction

replaced special purpose processors (emulators, ACP) Ethernet network Integrated with tape systems

cps – job scheduler, event manager


CERN

Condor – a hunter of unused cycles

The hunter of idle workstations (1986)

ClassAd Matchmaking users advertise their requirements systems advertise their capabilities & constraints

Directed Acyclic Graph Manager – DAGman define dependencies between jobs

Checkpoint – reschedule – restart if the owner of the workstation returns or if there is some failure

Share data through files global shared files Condor file system calls

Flocking interconnecting pools of Condor-content workstations

http://www.cs.wisc.edu/condor/


CERN

Layout of the Condor Pool

Central Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Desktop

schedd

startd

master

Desktop

schedd

startd

master

Cluster Node

master

startd

Cluster Node

master

startd

http://www.cs.wisc.edu/condor


CERN

How Flocking Works

Add a line to your condor_config :FLOCK_HOSTS = Pool-Foo, Pool-Bar

ScheddSchedd

CollectorCollector

NegotiatorNegotiator

Central Manager

(CONDOR_HOST)

CollectorCollector


Pool-Foo Central Manager

CollectorCollector


Pool-BarCentral Manager

SubmitMachine



CERN

Friendly Condor Pool

600 Condorjobs

Home Condor Pool



CERN

Finer grained HTC


CERN

The food chain in reverse – -- The PC has consumed the market for larger computers destroying the species -- There is no choice but to harness the PCs


CERN

Berkeley - Networks of Workstations (1994)

Single system view Shared resources Virtual machine Single address space

Global Layer Unix – GLUnix Serverless Network File Service – xFS

Research project

A Case for Networks of Workstations: NOW, IEEE Micro, Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson

http://now.cs.berkeley.edu


CERN

Beowulf

Nasa Goddard (Thomas Sterling, Donald Becker) - 1994 16 Intel PCs – Ethernet - Linux Caltech/JPL, Los Alamos Parallel applications from the Supercomputing

community Oak Ridge – 1996 – The Stone SouperComputer

problem – generate an eco-region map of the US, 1 km grid

64-way PC cluster proposal rejected re-cycle rejected desktop systems

The experience, emphasis on do-it-yourself, packaging of some of the tools, and probably the name – stimulated wide-spread adoption of clusters in the super-computing world


CERN

Parallel ROOT Facility - Proof

ROOT object oriented analysis tool

Queries are performed in parallel on an arbitrary number of processors

Load balancing: Slaves receive work

from Master process in “packets”

Packet size is adapted to current load, number of slaves, etc.

proof


CERN

LHC Computing


CERN

CERN's Users in the World

Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users


CERN

The Large Hadron Collider Project

4 detectors CMSATLAS

LHCb

Storage – Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 5-8 PetaBytes/year

10 PetaBytes of disk

Processing – 200,000 of today’s fastest PCs


CERN

source: CERN/LHCC/2001-004 - Report of the LHC Computing Review - 20 February 2001

(ATLAS with 270Hz trigger)Regional Grand

Tier 0 Tier 1 Total Centres Total

Processing (K SI95) 1,727 832 2,559 4,974 7,533Disk (PB) 1.2 1.2 2.4 8.7 11.1Magnetic tape (PB) 16.3 1.2 17.6 20.3 37.9

---------- CERN ----------

Summary of Computing Capacity Required for all LHC Experiments in 2007

Worldwide distributed computing system Small fraction of the analysis at CERN ESD analysis – using 12-20 large regional centres

how to use the resources efficiently establishing and maintaining a uniform physics environment

Data exchange – with tens of smaller regional centres, universities, labs


CERN

Estimated DISK Capacity at CERN

0

1000

2000

3000

4000

5000

6000

7000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

year

Ter

aByt

es

LHC

Other experiments

Estimated Mass Storage at CERN

LHC

Other experiments

0

20

40

60

80

100

120

14019

98

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

Year

Pet

aByt

es

Estimated CPU Capacity at CERN

0

1,000

2,000

3,000

4,000

5,000

6,000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

year

K S

I95

LHC

Other experiments

Moore’s law

Planned capacity evolution at CERN

Mass Storage Disk

CPU


CERN

Are Grids a solution?

The Grid – Ian Foster, Carl Kesselman – The Globus Project

“Dependable, consistent, pervasive access to [high-end] resources”

• Dependable:

• provides performance and functionality guarantees

• Consistent:

• uniform interfaces to a wide variety of resources

• Pervasive:

• ability to “plug in” from anywhere


CERN

The Grid

The GRID

ubiquitous access to computation

in the sense that the WEB provides

ubiquitous access to information


CERN

Globus Architecturewww.globus.org

Applications

Core ServicesMetacomputing

Directory Service

GRAMGlobus

Security Interface

Heartbeat Monitor

Nexus

Gloperf

Local ServicesLSF

Condor MPI

NQEEasy

TCP

SolarisIrixAIX

UDP

High-level Services and ToolsDUROC globusrunMPI Nimrod/GMPI-IO CC++

GlobusView Testbed Status

GASS

middleware

Uniform application program interface to grid resources

Grid infrastructure primitives

Mapped to local implementations, architectures, policies


CERN

The nodes of the Grid are managed by different people so have different access and usage policies and may have different architectures

The geographical distribution means that there cannot be a central status status information and resource availability is “published”

(remember Condor Classified Ads) Grid schedulers can only have an approximate view of

resources

The Grid Middleware tries to present this as a coherent virtual computing centre


CERN

Core Services

Security Information Service Resource Management – Grid scheduler, standard

resource allocation Remote Data Access – global namespace, caching,

replication Performance and Status Monitoring Fault detection Error Recovery Management


CERN

The Promise of Grid Technology

What does the Grid do for you? you submit your work and the Grid

Finds convenient places for it to be run Optimises use of the widely dispersed resources Organises efficient access to your data

Caching, migration, replication Deals with authentication to the different sites that you

will be using Interfaces to local site resource allocation mechanisms,

policies Runs your jobs Monitors progress Recovers from problems

.. and .. Tells you when your work is complete


CERN

CMSATLAS

LHCbCERN

Tier 0 Centre at CERN

physics group

LHC Computing Model2001 - evolving

regional group

les.

rob

ert

son

@ce

rn.c

h

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

Tier3physics

department

Desktop

Germany

Tier 1

USA

UK

France

Italy

……….

CERN Tier 1

……….

The LHC Computing

Centre

The opportunity ofGrid technology

CERN Tier 0