Trigger Software Upgrades John Baines & Tomasz Bold 1

1

Trigger Software Upgrades

John Baines & Tomasz Bold

2

Introduction• High Level Trigger Challenges:

– Faster than linear scaling of execution times with luminosity e.g. tracking =>

– Some HLT rejection power moved to L1 • Addition of L1Topo in phase-I• Addition of L1Track in Phase-II

– Need to maintain current levels of rejection (otherwise problem for offline)

Þ HLT needs to move closer to offline• In Phase-II: L1 rate increasing to 400kHz Þ More computing power needed at the HLT• But limited rack space & coolingÞ Need to efficiently use computing

technologies

EF ID tracking in muon RoI

Technologies• CPU: increased core counts - currently 18 core (36 threads) (e.g. Xeon E5 2600 v3 series)

– Trend to more cores possibly with lower memory per core – Run-2: one job/thread (athenaMP saves mem.) but may not be sustainable long-term=> Develop new Framework supporting concurrent executionÞ Ensure algorithms supports concurrent execution (thread-safe or can be cloned)

• Accelerators:– Rapid increase in power of GPGPUs e.g. Nvidia K40: 2880 cores 12 GB memory– Increased power & ease of programing of FPGA Þ Need to Monitor & evaluate key technologiesÞ Ensure ATLAS code doesn’t preclude use of accelerators

Þ Integrate accelerator support into framework e.g. OffLoadSvcÞ Ensure EDM doesn’t impose big overheads => flattening of EDM (xAOD help)

• Software tools:– New compilers & language standards e.g. support for multi-threading, accelerators etc.– Faster libraries (also existing libraries becoming unsupported)– New code optimisation tools: profiling Þ Assess new toolsÞ Recommendations, documentation, core help for migration

3

4

Concurrent Framework L1 Muon RoI

Some Key differences online c.f. offline:• Don’t reconstruct the whole event

– Because run at 100kHz i/p rate Þ Can only afford ~250ms/ev (for 25k core farm)– Trigger rejects 99 in 100 events => Use Regions of Interest => Chain Terminates when selection fails

• Error handling: – algorithm errors force routing of events to debug stream

• Configuration: from DataBase, rather than python (=> resproducible)– 3 integers specify: Menu & algorithm parameters, L1 prescales, HLT prescales

Þ Need additional Framework functionality – Run-1&2: provided by Trigger-specific additions to framework - HLT Steering & HLT navigation– Run-3 goal: functionality provided by common framework.

Key questions:• How to implement Event Views?• What extra Scheduler functionality is required?=> Address through Requirements Capture (FFReq) and prototyping (see Ben’s Talk)

5

HLT Farm• What will the HLT farm look like in 2020? In 2025?

– When & how do we narrow the technology options? • Choice affects software design as well as farm infrastructure

– How do we evaluate cost/benefits of different technologies? • Key criterion:

– Cost – financial, effort – Benefit – throughput per rack (events/s)– Constraints: cooling per rack, network…

• e.g. Important questions for assessing GPU technology:– Are GPU useful? What is the cost? What is the benefit?– What is the optimum balance of GPU to CPU?– What fraction of code (by CPU time) could realistically be ported to GPU?– What fraction of code must be ported to make GPU cost-effective– What is the overhead imposed by the EDM? How could it be reduced?

• See Dmitry's talk at FFReq on a possible GPU-friendly Identifiable Container=> Aim to get some answers through a Trigger Demonstrator: see Dmitry’s talk

https://indico.cern.ch/event/344685/contribution/0/material/slides/1.pdf

https://indico.cern.ch/event/344685/contribution/0/material/slides/1.pdf

GPGPU

6

Assume 50 HLT racks: Max. Power: 12kW per rack; Usable space: 47 U per rack• Compare a) CPU and b) CPU+GPU systems, where each rack has:

a) 10 x (2U with 4 motherboards, 8 CPU): 80 CPU; 11 kW; ~40 TFLOPSb) 16 x (Supermicro 1027GR-TR2 server): 32 CPU; 32 GPU ; ~12 kW

• CPU: Intel E5-2697v2 : 12 cores, ~0.5 TFLOPS• GPU: Nvidia K20: 2496 cores, 13 SMX, 3.5 (1.1) TFOPS for SP(DP)

Assume: Fixed cost; Fixed power/rackÞ win with CPU+GPU solution

when throughput per CPU increased by more than factor ~2.5

Þ 65% work (by CPU time) transferred to GPU

Speed-up of GPU code relative to cpu code: t(CPU)/t(GPU)

2.5

0.65Need to redo using results of demonstrator

• A toy cost-benefit analysis has been conducted based on todays technology.• Done to illustrate process – not enough information to draw any firm conclusions

7

Timescales: Framework, Steering & New Technologies

2014

Q3 Q4

LS 1

Design & Prototype

Implement core functionality

Extend to full functionality

Commissioning Run

Evaluate Implement Infrastructure

Exploit New. Tech. in Algorithms

Speed up code, thread-safety, investigate possibilities for internal parallelisation

Implement Algorithms in new framework.

HLT software Commissioning

Complete

Final Software Complete

Framework & Algos.

Fix PC architecture

FrameworkCore Functionality

Complete Incl. HLT components& new tech. support

Initial Framework

& HLT Components

available

Narrow h/w choices e.g. Use or not GPU

Run 3

Full menu complete

Simple menu

Framework Requirements

CaptureComplete

Framework

New Tech.

Algs & Menus

Prototype with 1 or 2 chains

8

Summary• For Run 3 we need :

– A framework supporting concurrent execution of algorithms– To make efficient use of computing technology (hw & sw)

• Work has started:– FFReq & Framework demonstrators– GPU demonstrator

• Success requires significant developments in core software, reconstruction and EDM– Algorithms must support concurrent execution (thread-safe or able to be cloned)– EDM must become data-orientated– Maintain and increase gains made via code optimization

• Vital for Trigger & Offline to work together to have common solutions

9

Additonal Material

10

Combined HLT:60% tracking20% Calo10% Muon10% other

Top CPU Consumers

in Run-1

11

GPUSTime (ms) Tau RoI 0.6x0.6

tt events 2x1034

C++ on 2.4 GHz

CPU

CUDA on Tesla

C2050

SpeedupCPU/GPU

Data. Prep. 27 3 9

Seeding 8.3 1.6 5

Seed ext. 156 7.8 20

Triplet merging

7.4 3.4 2

Clone removal 70 6.2 11

CPU GPU xfer n/a 0.1 n/a

Total 268 22 12

Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov)

Data Prep.

L2 Tracking

X2.4

X5

Max. speed-up: x26Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource

Blue: Tracking running on CPURed: Most Tracking steps on GPU,

final ambiguity solving on CPU

X2.4

• With balanced load on CPU/GPU, several CPU cores can share a GPUe.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU

Tesla C2050

13

Power & CoolingSDX racks:• Max. Power: 12kW• Usable space: 47 U• Current power ~300W per motherboard => max. 40 motherboards per rack.• Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW

Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power)

Illustrative farm configuration:50 racks Total

FarmNodes

CPU Cores(max threads)

GPU(SMX)

Required throughput per Node(per CPU core)

40 nodes per rack~300W/node

2,000 4,000 48,000 (96,000)

0 50 Hz(2.1 Hz)

10 nodes per rack4 GPU per node~1200W/node

500 1,000 12000 (24,000)

2,000(26,000)

200 Hz(8.3 Hz)


800 1,600 19,200 (38,400)

1,600(20,800)

125 Hz(5.2 Hz)

x4

X2.5

14

GPUSTime (ms) Tau RoI 0.6x0.6

tt events 2x1034

C++ on 2.4 GHz

CPU

CUDA on Tesla

C2050

SpeedupCPU/GPU

Data. Prep. 27 3 9

Seeding 8.3 1.6 5

Seed ext. 156 7.8 20

Triplet merging

7.4 3.4 2

Clone removal 70 6.2 11

CPU GPU xfer n/a 0.1 n/a

Total 268 22 12

Example of complete L2 ID chain implemented on GPU (Dmitry Emeliyanov)

Data Prep.

L2 Tracking

X2.4

X5

Max. speed-up: x26Overall speed-up t(GPU/t(CPU): 12

Sharing of GPU resource

Blue: Tracking running on CPURed: Most Tracking steps on GPU,

final ambiguity solving on CPU

X2.4

• With balanced load on CPU/GPU, several CPU cores can share a GPUe.g. Test with L2 ID tracking with 8 CPU cores sharing one C2050 GPU

Tesla C2050

16

Packaging

1U: 2xE5-2600 or E5-2600v23xGPU

2U: 2xE5-2600 or E5-2600v24xGPU

Examples:

Total for 1027 or 2027 with 2 K20 GPU: ~15k CHF=> 12 CPU cores/GPU

Total for 2027 with 4 K20 GPU: ~20k CHF=> 6 CPU cores/GPU

Chasis: Supermicro 1027GR-TR2 or 2027GR-TR2CPU: Intel E5-2697v2 CPU

• 12 cores, ~0.5 TFLOPS, ~2.3kCHFGPU: Nvidia K20 GPU

• 2496 cores, 13 SMX, 192 cores per SMX • 3.5 (1.1) TFOPS for SP(DP), ~2.4k CHF

17

Power & CoolingSDX racks:• Max. Power: 12kW• Usable space: 47 U• Current power ~300W per motherboard => max. 40 motherboards per rack.• Compare 2U unit: a) 4 motherboards, 8CPU 1.1 kW b) 1 motherboard, 2 CPU with 2 GPU (750W) or 4 GPU 1.2kW

Based on Max. Power: K20 GPU: 225W c.f. E5-2697v2 CPU: 130W (need to measure typical power)

Illustrative farm configuration:50 racks Total

FarmNodes

CPU Cores(max threads)

GPU(SMX)

Required throughput per Node(per CPU core)

40 nodes per rack~300W/node

2,000 4,000 48,000 (96,000)

0 50 Hz(2.1 Hz)


500 1,000 12000 (24,000)

2,000(26,000)

200 Hz(8.3 Hz)


800 1,600 19,200 (38,400)

1,600(20,800)

125 Hz(5.2 Hz)

x4

X2.5

18

Summary• Current limiting factor is cooling: 12kW/rack

Þ Adding GPU means removing CPUÞ For fixed cooling would have a factor 2.5(4) less CPU when adding 2(4) GPU

• Financial cost ~25%(70%) more per 2U with 2CPU and 2(4) GPU than a 2U with 8 CPU. => For fixed cooling and fixed cost would have factor 5-7 less CPU

=> win with CPU+GPU solution when throughput per CPU increased by more than factor 5-7 => 80-85% work (by CPU time) transferred to GPU• Whether we need 1 or 2 GPU per CPU depends on relative CPU & GPU load

19

Increase in Throughput per CPU when GPU added

Speed-upt(CPU)/t(GPU)

CPU code serial: waits

for GPU completion

Fraction defined in terms of execution time on CPU

20

6 jobs per GPU

21

Documents

Trigger Software Upgrades John Baines & Tomasz Bold 1