Upload
guest3bd2a12
View
3.350
Download
1
Embed Size (px)
Citation preview
[email protected] foil 1 last update: 11/04/23 22:53
CERN
What is High Throughput Distributed Computing?
CERN Computing Summer School 2001
Santander
Les RobertsonCERN - IT Division
[email protected] foil 2 last update 11/04/23 22:53
CERN
Outline
High Performance Computing (HPC) and High Throughput Computing (HTC)
Parallel processing so difficult with HPC applications so easy with HTC
Some models of distributed computing
HEP applications Offline computing for LHC Extending HTC to the Grid
[email protected] foil 3 last update 11/04/23 22:53
CERN
“Speeding Up” the Calculation?
Use the fastest processor available
-- but this gives only a small factor over modest (PC) processors
Use many processors, performing bits of the problem in parallel
-- and since quite fast processors are inexpensive we can think
of using very many processors in parallel
[email protected] foil 4 last update 11/04/23 22:53
CERN
High Performance – or – High Throughput?
The key questions are – granularity & degree of parallelism Have you got one big problem or a bunch of little ones?
To what extent can the “problem” be decomposed into sort-of-independent parts (grains) that can all be processed in parallel?
Granularity fine-grained parallelism – the independent bits are small,
need to exchange information, synchronise often coarse-grained – the problem can be decomposed into
large chunks that can be processed independently
Practical limits on the degree of parallelism – how many grains can be processed in parallel? degree of parallelism v. grain size grain size limited by the efficiency of the system at
synchronising grains
[email protected] foil 5 last update 11/04/23 22:53
CERN
High Performance – v. – High Throughput?
fine-grained problems need a high performance system that enables rapid synchronisation between the bits that
can be processed in parallel and runs the bits that are difficult to parallelise as fast as
possible coarse-grained problems can use a high throughput
system that maximises the number of parts processed per minute
High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected
while High Performance Systems use a smaller number of more expensive processors expensively interconnected
[email protected] foil 6 last update 11/04/23 22:53
CERN
High Performance – v. – High Throughput?
There is nothing fundamental here – it is just a question of financial trade-offs like:
how much more expensive is a “fast” computer than a bunch of slower ones?
how much is it worth to get the answer more quickly? how much investment is necessary to improve the degree
of parallelisation of the algorithm?
But the target is moving - Since the cost chasm first opened between fast and
slower computers 12-15 years ago an enormous effort has gone into finding parallelism in “big” problems
Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities –
clusters fabrics (inter)national grids
demanding ever-greater degrees of parallelism
[email protected] foil 8 last update 11/04/23 22:53
CERN
A quick look at HPC problems
Classical high-performance applications numerical simulations of complex systems such as
weather climate combustion mechanical devices and structures crash simulation electronic circuits manufacturing processes chemical reactions
image processing applications like medical scans military sensors earth observation, satellite reconnaisance seismic prospecting
[email protected] foil 9 last update 11/04/23 22:53
CERN
Approaches to parallelism
Domain decomposition
Functional decomposition
graphics from Designing and Building Parallel Programs (Online), by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
[email protected] foil 10 last update 11/04/23 22:53
CERN
Of course – it’s not that simple
graphic from Designing and Building Parallel Programs (Online), by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
[email protected] foil 11 last update 11/04/23 22:53
CERN
The design process
Data or functional decomposition building an abstract task model
Building a model for communicationbetween tasks
interaction patterns
Agglomeration – to fit the abstractmodel to the constraints of the target hardware
interconnection topology speed, latency, overhead of
communications
Mapping the tasks to the processors load balancing task scheduling
graphic from Designing and Building Parallel Programs (Online),by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/
[email protected] foil 12 last update 11/04/23 22:53
CERN
Large scale parallelism – the need for standards
“Supercomputer” market is on trouble; diminishing number of suppliers; questionable future
Increasingly risky to design for specific tightly coupled architectures like - SGI (Cray, Origin), NEC, Hitachi
Require a standard for communication between partitions/tasks that works also on loosely coupled systems (“massively parallel processors” – MPP – IBM SP, Compaq)
Paradigm is message passing rather than shared memory – tasks rather than threads
Parallel Virtual Machine - PVM MPI – Message Passing Interface
[email protected] foil 13 last update 11/04/23 22:53
CERN
MPI – Message Passing Interface
industrial standard – http://www.mpi-forum.org source code portability widely available; efficient implementations
SPMD (Single Program Multiple Data) model Point-to-point communication (send/receive/wait;
blocking/non-blocking) Collective operations (broadcast; scatter/gather; reduce) Process groups, topologies
comprehensive and rich functionality
[email protected] foil 14 last update 11/04/23 22:53
CERN
MPI – Collective operations
IBM Redbook - http://www.redbooks.ibm.com/redbooks/SG245380.html
Defining high level datafunctions allows highlyefficient implementations,e.g. minimising data copies
[email protected] foil 15 last update 11/04/23 22:53
CERN
The limits of parallelism - Amdahl’s Law
If we have N processors:
s + p Speedup = ———— s + p/N
taking s as the fraction of the time spent in the sequential part of the program ( s + p = 1)
1 Speedup = ———— 1/s
s + (1-s)/N
s – time spent in a serial processor on serial parts of the code
p – time spent in a serial processor on parts that could be executed in parallel
Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capabilityProc. AFIPS Conf., Reston, VA, 1967, pp. 483-485
[email protected] foil 16 last update 11/04/23 22:53
CERN
Amdahl’s Law - maximum speedup
max speedup
0
20
40
60
80
100
120
1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
sequential part as percentage of total time
sp
ee
du
p
[email protected] foil 17 last update 11/04/23 22:53
CERN
Load Balancing - real life is (much) worse
Often have to use barrier synchronisation between each step, and different cells require different amounts of computation
Real time sequential part s = si
Real time parallelisable part on a sequential processor p = k jpk
j
Real time parallelised T = s + max(pkj)
T = s + max(pkj) >> s + p/N
s1
si
sj
sN
… …pk
1
pkj pk
M
… …pK1 pK
j
pKM
… …p11 p1
j p1M
:
:
:
:
:
:t
[email protected] foil 18 last update 11/04/23 22:53
CERN
Gustafson’s Interpretation
The problem size scales with the number of processors
With a lot more processors (computing capacity) available you can and will do much more work in less time
The complexity of the application rises to fill the capacity available
But the sequential part remains approximately constant
Gustafson, J.L., Re-evaluating Amdahl’s Law, CACM 31(5), 1988, pp. 532-533
[email protected] foil 19 last update 11/04/23 22:53
CERN
Amdahl’s Law - maximum speedup with Gustafson’s appetite
max speedup
0
500
1000
1500
2000
2500
0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0%
sequential part as % of total time
sp
ee
du
p
potential 1,000 X speedup with 0.1% sequential code
[email protected] foil 20 last update 11/04/23 22:53
CERN
The importance of the network
Communication Overhead adds to the inherent sequential part of the program to limit the Amdahl speedup
Latency – the round-trip time (RTT) to communicate between two processors
communications overhead
c = latency + data_transfer_time
s + p
Speedup = ————— s + c + p/N
For fine grained parallel programs the problem is latency, not bandwidtht
… …
… …
… …
sequential
communications overhead
parallelisable
[email protected] foil 21 last update 11/04/23 22:53
CERN
Latency
Comparison – Efficient MPI implementation on Linux cluster (source: Real World Computing Partnership, Tsukuba Research Center)
Network Bandwidth(MByte/sec)
RTT Latency(microsecond)
Myrinet 146 20
Gigabit Ethernet(Sysconect) 73 61
Fast Ethernet(EEPRO100) 11 100
[email protected] foil 23 last update 11/04/23 22:53
CERN
High Throughput Computing - HTC
Roughly speaking –
HPC – deals with one large problem HTC – is appropriate when the problem can be decomposed into
many (very many) smaller problems that are essentially independent
Build a profile of all MasterCard customers who purchased an airline ticket and rented a car in August
Analyse the purchase patterns of Wallmart customers in the LA area last month
Generate 106 CMS events Web surfing, Web searching Database queries
HPC – problems that are hard to parallelise – single processor performance is important
HTC – problems that are easy to parallelise – can be adapted to very large numbers of processors
[email protected] foil 24 last update 11/04/23 22:53
CERN
HTC - HPC
High Performance
Granularity largely defined by the algorithm, limitations in the hardware Load balancing difficult Hard to schedule different workloads Reliability is all important
if one part fails the calculation stops (maybe even aborts!) check-pointing essential – all the processes must be restarted from the same synchronisation point hard to dynamically re- configure for smaller number of processors
High Throughput
Granularity can be selected to fit the environment
Load balancing easy Mixing workloads is easy
Sustained throughput is the key goal
the order in which the individual tasks execute is (usually) not important if some equipment goes down the work can be re-run later easy to re-schedule dynamically the workload to different configurations
[email protected] foil 26 last update 11/04/23 22:53
CERN
Distributed Computing
Local distributed systems Clusters Parallel computers (IBM SP)
Geographically distributed systems Computational Grids
HPC – as we have seen Needs low latency AND good communication bandwidth
HTC distributed systems The bandwidth is important, the latency is less significant If latency is poor more processes can be run in parallel to
cover the waiting time
[email protected] foil 27 last update 11/04/23 22:53
CERN
Shared Data
If the granularity is course enough –the different parts of the problem can be synchronised simply by sharing data
Example – event reconstruction all of the events to be reconstructed are stored in a large
data store processes (jobs) read successive raw events, generating
processed event records, until there are no raw events left
the result is the concatenation of the processed events (and folding together some histogram data)
synchronisation overhead can be minimised by partitioning the input and output data
[email protected] foil 28 last update 11/04/23 22:53
CERN
Data Sharing - Files
Global file namespace maps universal name to network node, local name
Remote data access Caching strategies
Local or intermediate caching Replication Migration
Access control, authentication issues Locking issues
NFS AFS Web folders
Highly scalable for read-only data
[email protected] foil 29 last update 11/04/23 22:53
CERN
Data Sharing – Databases, Objects
File sharing is probably the simplest paradigm for building distributed systems
Database and object sharing look the same
But – Files are universal, fundamental systems concepts –
standard interfaces, functionality Databases are not yet fundamental, built-in
but there are only a few standards Objects even less so – still at the application level – so
harder to implement efficient and universal caching, remote access, etc.
[email protected] foil 30 last update 11/04/23 22:53
CERN
Client-server
Examples Web browsing Online banking Order entry ……..
The functionality is divided between the two parts – for example exploit locality of data (e.g. perform searches, transformations on
node where data resides) exploit different hardware capabilities (e.g. central supercomputer,
graphics workstation) security concerns – restrict sensitive data to defined geographical
locations (e.g. account queries) reliability concerns (e.g. perform database updates on highly reliable
servers)
Usually the server implements pre-defined, standardised functions
client serverrequest
response
[email protected] foil 31 last update 11/04/23 22:53
CERN
server
client
client
client
server
client
client
client
server
client
client
client
database
server
• data extracts replicated on intermediate servers• changes batched for asynchronous treatment by database server
Enables - • scaling up client query capacity• isolation of main database
3-Tier client-server
[email protected] foil 32 last update 11/04/23 22:53
CERN
Peer-to-Peer - P2P
Peer-to-Peer decentralisation of function and control Taking advantage of the computational resources at the edge
of the network The functions are shared between the distributed parts –
without central control Programs to cooperate without being designed as a single
application So P2P is just a democratic form of parallel programming -
SETI The parallel HPC problems we have looked at, using MPI
All the buzz of P2P is because new interfaces promise to bring this to the commercial world; allow different communities, businesses to collaborate through the internet
XML SOAP .NET JXTA
[email protected] foil 33 last update 11/04/23 22:53
CERN
Simple Object Access Protocol - SOAP
SOAP – simple, lightweight mechanism for exchanging objects between peers in a distributed environment using XML carried over HTTP
SOAP consists of three parts: The SOAP envelope - what is in a message; who should
deal with it, and whether it is optional or mandatory The SOAP encoding rules - serialisation definition for
exchanging instances of application-defined datatypes. The SOAP Remote Procedure Call representation
[email protected] foil 34 last update 11/04/23 22:53
CERN
Microsoft’s .NET
.NET is a framework, or environment for building, deploying and running Web services and other internet applications
Common Language Runtime - C++, C#, Visual Basic and JScript
Framework classes Aiming at a standard but Windows only
[email protected] foil 35 last update 11/04/23 22:53
CERN
JXTA
Interoperability locating JXTA peers communication
Platform, language and network independence Implementable on anything –
phone – VCR - PDA – PC A set of protocols Security model Peer discovery Peer groups XML encoding
http://www.jxta.org/project/www/docs/TechOverview.pdf
[email protected] foil 36 last update: 11/04/23 22:53
CERN
End of Part 1
Tomorrow:
HEP applicationsOffline computing for LHCExtending HTC to the Grid
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Data Handling and Computation for
Physics Analysisevent filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
les.
rob
ert
son
@ce
rn.c
h
CERN
[email protected] foil 39 last update 11/04/23 22:53
CERN
HEP Computing Characteristics
Large numbers of independent events - trivial parallelism – “job” granularity
Modest floating point requirement - SPECint performance Large data sets - smallish records, mostly read-only Modest I/O rates - few MB/sec per fast processor
Simulation cpu-intensive mostly static input data very low output data rate
Reconstruction very modest I/O easy to partition input data easy to collect output data
[email protected] foil 40 last update 11/04/23 22:53
CERN
Analysis
• ESD analysis
• modest I/O rates
• read only ESD
BUT
Very large input database
Chaotic workload –
• unpredictable, no limit to the requirements
• AOD analysis
• potentially very high I/O rates
• but modest database
[email protected] foil 41 last update 11/04/23 22:53
CERN
HEP Computing Characteristics
Large numbers of independent events - trivial parallelism – “job” granularity
Large data sets - smallish records, mostly read-only
Modest I/O rates - few MB/sec per fast processor
Modest floating point requirement - SPECint performance
Chaotic workload –
• research environment unpredictable, no limit to the requirements
Very large aggregate requirements – computation, data• Scaling up is not just big – it is also complex• …and once you exceed the capabilities of a single
geographical installation ………?
[email protected] foil 43 last update 11/04/23 22:53
CERN
Task Farming
Decompose the data into large independent chunks Assign one task (or job) to each chunk Put all the tasks in a queue for a scheduler
which manages a large “farm” of processors, each of which has access to all of the data
The scheduler runs one or more jobs on each processor When a job finishes the next job in the queue is started Until all the jobs have been run Collect the output files
[email protected] foil 44 last update 11/04/23 22:53
CERN
Task Farming
Task farming is good for
a very large problem
Which has
selectable granularity
largely independent tasks
loosely shared data
HEP –-- Simulation-- Reconstruction-- and much of the Analysis
[email protected] foil 45 last update 11/04/23 22:53
CERN
The SHIFT Software Model (1990)
les.
rob
ert
son
@ce
rn.c
h
diskservers
applicationservers
stage (migration)servers
tapeservers
queueservers
IP network
From the application’s viewpoint – this is simply file sharing –
all data available to all processes
standard APIs –disk I/O; mass storage; job scheduler; can be implemented over an IP network
mass storage model – tape data cached on disk (stager)
physical implementation - transparent to the application/user
scalable, heterogeneous
flexible evolution – scalable capacity; multiple platforms;seamless integration of new technologies
[email protected] foil 46 last update 11/04/23 22:53
CERN
Current Implementation of SHIFT
massstorage
applicationservers
WAN
data cache
racks of dual-cpuLinux PCsLinux PC controllers
IDE disks
Linux PC controllersRobots – STK Powderhorn
Drives - STK 9840,STK 9940, IBM 3590
Ethernet100BaseT, Gigabit
[email protected] foil 47 last update 11/04/23 22:53
CERN
Fermilab Reconstruction Farms
1991 – farms of RISC workstations introduced for reconstruction
replaced special purpose processors (emulators, ACP) Ethernet network Integrated with tape systems
cps – job scheduler, event manager
[email protected] foil 48 last update 11/04/23 22:53
CERN
Condor – a hunter of unused cycles
The hunter of idle workstations (1986)
ClassAd Matchmaking users advertise their requirements systems advertise their capabilities & constraints
Directed Acyclic Graph Manager – DAGman define dependencies between jobs
Checkpoint – reschedule – restart if the owner of the workstation returns or if there is some failure
Share data through files global shared files Condor file system calls
Flocking interconnecting pools of Condor-content workstations
http://www.cs.wisc.edu/condor/
[email protected] foil 49 last update 11/04/23 22:53
CERN
Layout of the Condor Pool
Central Manager
master
collector
negotiator
schedd
startd
= ClassAd Communication Pathway
= Process Spawned
Desktop
schedd
startd
master
Desktop
schedd
startd
master
Cluster Node
master
startd
Cluster Node
master
startd
http://www.cs.wisc.edu/condor
[email protected] foil 50 last update 11/04/23 22:53
CERN
How Flocking Works
Add a line to your condor_config :FLOCK_HOSTS = Pool-Foo, Pool-Bar
ScheddSchedd
CollectorCollector
NegotiatorNegotiator
Central Manager
(CONDOR_HOST)
CollectorCollector
NegotiatorNegotiator
Pool-Foo Central Manager
CollectorCollector
NegotiatorNegotiator
Pool-BarCentral Manager
SubmitMachine
http://www.cs.wisc.edu/condor
[email protected] foil 51 last update 11/04/23 22:53
CERN
Friendly Condor Pool
600 Condorjobs
Home Condor Pool
http://www.cs.wisc.edu/condor
[email protected] foil 53 last update 11/04/23 22:53
CERN
The food chain in reverse – -- The PC has consumed the market for larger computers destroying the species -- There is no choice but to harness the PCs
[email protected] foil 54 last update 11/04/23 22:53
CERN
Berkeley - Networks of Workstations (1994)
Single system view Shared resources Virtual machine Single address space
Global Layer Unix – GLUnix Serverless Network File Service – xFS
Research project
A Case for Networks of Workstations: NOW, IEEE Micro, Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson
http://now.cs.berkeley.edu
[email protected] foil 55 last update 11/04/23 22:53
CERN
Beowulf
Nasa Goddard (Thomas Sterling, Donald Becker) - 1994 16 Intel PCs – Ethernet - Linux Caltech/JPL, Los Alamos Parallel applications from the Supercomputing
community Oak Ridge – 1996 – The Stone SouperComputer
problem – generate an eco-region map of the US, 1 km grid
64-way PC cluster proposal rejected re-cycle rejected desktop systems
The experience, emphasis on do-it-yourself, packaging of some of the tools, and probably the name – stimulated wide-spread adoption of clusters in the super-computing world
[email protected] foil 56 last update 11/04/23 22:53
CERN
Parallel ROOT Facility - Proof
ROOT object oriented analysis tool
Queries are performed in parallel on an arbitrary number of processors
Load balancing: Slaves receive work
from Master process in “packets”
Packet size is adapted to current load, number of slaves, etc.
proof
[email protected] foil 58 last update 11/04/23 22:53
CERN
CERN's Users in the World
Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users
[email protected] foil 59 last update 11/04/23 22:53
CERN
The Large Hadron Collider Project
4 detectors CMSATLAS
LHCb
Storage – Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk
Processing – 200,000 of today’s fastest PCs
[email protected] foil 60 last update 11/04/23 22:53
CERN
source: CERN/LHCC/2001-004 - Report of the LHC Computing Review - 20 February 2001
(ATLAS with 270Hz trigger)Regional Grand
Tier 0 Tier 1 Total Centres Total
Processing (K SI95) 1,727 832 2,559 4,974 7,533Disk (PB) 1.2 1.2 2.4 8.7 11.1Magnetic tape (PB) 16.3 1.2 17.6 20.3 37.9
---------- CERN ----------
Summary of Computing Capacity Required for all LHC Experiments in 2007
Worldwide distributed computing system Small fraction of the analysis at CERN ESD analysis – using 12-20 large regional centres
how to use the resources efficiently establishing and maintaining a uniform physics environment
Data exchange – with tens of smaller regional centres, universities, labs
[email protected] foil 61 last update: 11/04/23 22:53
CERN
Estimated DISK Capacity at CERN
0
1000
2000
3000
4000
5000
6000
7000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
year
Ter
aByt
es
LHC
Other experiments
Estimated Mass Storage at CERN
LHC
Other experiments
0
20
40
60
80
100
120
14019
98
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Year
Pet
aByt
es
Estimated CPU Capacity at CERN
0
1,000
2,000
3,000
4,000
5,000
6,000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
year
K S
I95
LHC
Other experiments
Moore’s law
Planned capacity evolution at CERN
Mass Storage Disk
CPU
[email protected] foil 62 last update 11/04/23 22:53
CERN
Are Grids a solution?
The Grid – Ian Foster, Carl Kesselman – The Globus Project
“Dependable, consistent, pervasive access to [high-end] resources”
• Dependable:
• provides performance and functionality guarantees
• Consistent:
• uniform interfaces to a wide variety of resources
• Pervasive:
• ability to “plug in” from anywhere
[email protected] foil 63 last update 11/04/23 22:53
CERN
The Grid
The GRID
ubiquitous access to computation
in the sense that the WEB provides
ubiquitous access to information
[email protected] foil 64 last update 11/04/23 22:53
CERN
Globus Architecturewww.globus.org
Applications
Core ServicesMetacomputing
Directory Service
GRAMGlobus
Security Interface
Heartbeat Monitor
Nexus
Gloperf
Local ServicesLSF
Condor MPI
NQEEasy
TCP
SolarisIrixAIX
UDP
High-level Services and ToolsDUROC globusrunMPI Nimrod/GMPI-IO CC++
GlobusView Testbed Status
GASS
middleware
Uniform application program interface to grid resources
Grid infrastructure primitives
Mapped to local implementations, architectures, policies
[email protected] foil 65 last update 11/04/23 22:53
CERN
The nodes of the Grid are managed by different people so have different access and usage policies and may have different architectures
The geographical distribution means that there cannot be a central status status information and resource availability is “published”
(remember Condor Classified Ads) Grid schedulers can only have an approximate view of
resources
The Grid Middleware tries to present this as a coherent virtual computing centre
[email protected] foil 66 last update 11/04/23 22:53
CERN
Core Services
Security Information Service Resource Management – Grid scheduler, standard
resource allocation Remote Data Access – global namespace, caching,
replication Performance and Status Monitoring Fault detection Error Recovery Management
[email protected] foil 67 last update 11/04/23 22:53
CERN
The Promise of Grid Technology
What does the Grid do for you? you submit your work and the Grid
Finds convenient places for it to be run Optimises use of the widely dispersed resources Organises efficient access to your data
Caching, migration, replication Deals with authentication to the different sites that you
will be using Interfaces to local site resource allocation mechanisms,
policies Runs your jobs Monitors progress Recovers from problems
.. and .. Tells you when your work is complete
[email protected] foil 68 last update 11/04/23 22:53
CERN
CMSATLAS
LHCbCERN
Tier 0 Centre at CERN
physics group
LHC Computing Model2001 - evolving
regional group
les.
rob
ert
son
@ce
rn.c
h
Tier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
Tier3physics
department
Desktop
Germany
Tier 1
USA
UK
France
Italy
……….
CERN Tier 1
……….
The LHC Computing
Centre
The opportunity ofGrid technology
CERN Tier 0