12
Online Event Selection Online Event Selection in the CBM Experiment in the CBM Experiment I. Kisel I. Kisel GSI GSI , Darmstadt, Germany , Darmstadt, Germany RT'09 RT'09 , Beijing , Beijing May 12, 2009 May 12, 2009

Online Event Selection in the CBM Experiment I. Kisel GSI, Darmstadt, Germany RT'09, Beijing May 12, 2009

Embed Size (px)

Citation preview

Online Event Selection Online Event Selection in the CBM Experimentin the CBM Experiment

I. KiselI. Kisel GSIGSI, Darmstadt, Germany, Darmstadt, Germany

RT'09RT'09, Beijing, BeijingMay 12, 2009May 12, 2009

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 22/12/12

Tracking Challenge in CBM (FAIR/GSI, Germany)Tracking Challenge in CBM (FAIR/GSI, Germany)

* Future fixed-target heavy-ion experiment* 107 Au+Au collisions/sec* ~ 1000 charged particles/collision* Non-homogeneous magnetic field* Double-sided strip detectors (85% fake space points)

Track reconstruction in STS/MVD and displaced vertex search required in the first trigger level

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 33/12/12

Open Charm Event SelectionOpen Charm Event Selection

D (c = 312 m): D+ K-++ (9.5%)D0 (c = 123 m): D0 K-+ (3.8%) D0 K- ++- (7.5%) D

s (c = 150 m): D+

s K+K-+ (5.3%)

+c (c = 60 m):

+c pK-+ (5.0%)

No simple trigger primitive, like high pt, available to tag events of interest. The only selective signature is the detection of the decay vertex.

KK--

+

First level event selection will be done in a processor farm fed with data from the event building network

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 44/12/12

Many-core HPCMany-core HPC

• Heterogeneous systems of many coresHeterogeneous systems of many cores• Uniform approach to all CPU/GPU familiesUniform approach to all CPU/GPU families• Similar programming languages (CUDA, Ct, OpenCL)Similar programming languages (CUDA, Ct, OpenCL)• Parallelization of the algorithm (vectors, multi-threads, many-cores)Parallelization of the algorithm (vectors, multi-threads, many-cores)

• On-line event selectionOn-line event selection• Mathematical and computational optimizationMathematical and computational optimization• Optimization of the detectorOptimization of the detector

CoresCores

HW ThreadsHW ThreadsSIMD widthSIMD width

? OpenCL ?? OpenCL ?? OpenCL ?? OpenCL ?

GamingGaming STI: STI: CellCell

GamingGaming STI: STI: CellCell

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

CPUCPU Intel: Intel: XX-coresXX-cores

CPUCPU Intel: Intel: XX-coresXX-cores

FPGAFPGA Xilinx: Xilinx: VirtexVirtex

FPGAFPGA Xilinx: Xilinx: VirtexVirtex

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 55/12/12

Cellular Automaton Track FinderCellular Automaton Track Finder

11 Create tripletsCreate triplets

22 Collect tracksCollect tracks

Three types of tracks and three stages of track finding:Three types of tracks and three stages of track finding:fast primary tracks / all primary tracks / all tracksfast primary tracks / all primary tracks / all tracks

Hits : 30000/ 9000/ 5000Hits : 30000/ 9000/ 5000

Triplets: 20000Triplets: 20000//10000/ 300010000/ 3000

Tracks: 500/ 200/ 10Tracks: 500/ 200/ 10

Cellular AutomatonCellular Automaton

Kalman Kalman FilterFilter

FPGFPGAA

GPUGPU

CPUCPU

??

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 66/12/12

Core of Reconstruction: Core of Reconstruction: Kalman Filter based Track FitKalman Filter based Track Fit

detectorsmeasurements

(r, C)(r, C)

track parametersand errors

Initial estimatesfor and

Prediction step

State estimateError covariance

Filtering step22xx

22yy … …

22zz

22pxpx

… … 22pypy

22pzpz

CC = =

error of x

rr = { x, y, z, p = { x, y, z, pxx, p, pyy, , ppzz } }

position momentum

State State vectorvector

CovariancCovariance e

matrixmatrix

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 77/12/12

Code (Part of the Kalman Filter)Code (Part of the Kalman Filter)

inline voidinline void AddMaterial( AddMaterial( TrackV TrackV &track, &track, StationStation &st, &st, Fvec_tFvec_t &qp0 ) &qp0 ){{ cnstcnst mass2 = 0.1396*0.1396; mass2 = 0.1396*0.1396;

Fvec_tFvec_t tx = track.T[2]; tx = track.T[2]; Fvec_tFvec_t ty = track.T[3]; ty = track.T[3]; Fvec_tFvec_t txtx = tx*tx; txtx = tx*tx; Fvec_tFvec_t tyty = ty*ty; tyty = ty*ty; Fvec_tFvec_t txtx1 = txtx + ONE; txtx1 = txtx + ONE; Fvec_tFvec_t h = txtx + tyty; h = txtx + tyty; Fvec_tFvec_t t = sqrt(txtx1 + tyty); t = sqrt(txtx1 + tyty); Fvec_tFvec_t h2 = h*h; h2 = h*h; Fvec_tFvec_t qp0t = qp0*t; qp0t = qp0*t; cnstcnst c1=0.0136, c2=c1*0.038, c3=c2*0.5, c4=-c3/2.0, c5=c3/3.0, c6=-c3/4.0; c1=0.0136, c2=c1*0.038, c3=c2*0.5, c4=-c3/2.0, c5=c3/3.0, c6=-c3/4.0; Fvec_tFvec_t s0 = (c1+c2*st.logRadThick + c3*h + h2*(c4 + c5*h +c6*h2) )*qp0t; s0 = (c1+c2*st.logRadThick + c3*h + h2*(c4 + c5*h +c6*h2) )*qp0t; Fvec_tFvec_t a = (ONE+mass2*qp0*qp0t)*st.RadThick*s0*s0; a = (ONE+mass2*qp0*qp0t)*st.RadThick*s0*s0;

CovVCovV &C = track.C; &C = track.C;

C.C22 += txtx1*a;C.C22 += txtx1*a; C.C32 += tx*ty*a; C.C33 += (ONE+tyty)*a; C.C32 += tx*ty*a; C.C33 += (ONE+tyty)*a; }}

Use headers to overload +, -, *, / operators --> the source code is Use headers to overload +, -, *, / operators --> the source code is unchanged !unchanged !

Use headers to overload +, -, *, / operators --> the source code is Use headers to overload +, -, *, / operators --> the source code is unchanged !unchanged !

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 88/12/12

Header (Intel’s SSE)Header (Intel’s SSE)

typedeftypedef F32vec4 Fvec_t; F32vec4 Fvec_t; /* Arithmetic Operators *//* Arithmetic Operators */ friendfriend F32vec4 F32vec4 operatoroperator +( +(constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b) { F32vec4 &b) { returnreturn _mm_add_ps(a,b); } _mm_add_ps(a,b); } friendfriend F32vec4 F32vec4 operatoroperator -( -(constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b) { F32vec4 &b) { returnreturn _mm_sub_ps(a,b); } _mm_sub_ps(a,b); } friendfriend F32vec4 F32vec4 operatoroperator *( *(constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b) { F32vec4 &b) { returnreturn _mm_mul_ps(a,b); } _mm_mul_ps(a,b); } friendfriend F32vec4 F32vec4 operatoroperator /( /(constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b) { F32vec4 &b) { returnreturn _mm_div_ps(a,b); } _mm_div_ps(a,b); } /* Functions *//* Functions */ friendfriend F32vec4 min( F32vec4 min( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ F32vec4 &b ){ returnreturn _mm_min_ps(a, b); } _mm_min_ps(a, b); } friendfriend F32vec4 max( F32vec4 max( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ F32vec4 &b ){ returnreturn _mm_max_ps(a, b); } _mm_max_ps(a, b); } /* Square Root *//* Square Root */ friendfriend F32vec4 sqrt ( F32vec4 sqrt ( constconst F32vec4 &a ){ F32vec4 &a ){ returnreturn _mm_sqrt_ps (a); } _mm_sqrt_ps (a); } /* Absolute value *//* Absolute value */ friendfriend F32vec4 fabs( F32vec4 fabs( constconst F32vec4 &a){ F32vec4 &a){ returnreturn _mm_and_ps(a, _f32vec4_abs_mask); } _mm_and_ps(a, _f32vec4_abs_mask); } /* Logical *//* Logical */ friendfriend F32vec4 F32vec4 operatoroperator&( &( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ // mask returned F32vec4 &b ){ // mask returned returnreturn _mm_and_ps(a, b); _mm_and_ps(a, b); }} friendfriend F32vec4 F32vec4 operatoroperator|( |( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ // mask returned F32vec4 &b ){ // mask returned returnreturn _mm_or_ps(a, b); _mm_or_ps(a, b); }} friendfriend F32vec4 F32vec4 operatoroperator^( ^( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ // mask returned F32vec4 &b ){ // mask returned returnreturn _mm_xor_ps(a, b); _mm_xor_ps(a, b); }} friendfriend F32vec4 F32vec4 operatoroperator!( !( constconst F32vec4 &a ){ // mask returned F32vec4 &a ){ // mask returned returnreturn _mm_xor_ps(a, _f32vec4_true); _mm_xor_ps(a, _f32vec4_true); }} friendfriend F32vec4 F32vec4 operatoroperator||( ||( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ // mask returned F32vec4 &b ){ // mask returned returnreturn _mm_or_ps(a, b); _mm_or_ps(a, b); }} /* Comparison *//* Comparison */ friendfriend F32vec4 F32vec4 operatoroperator<( <( constconst F32vec4 &a, F32vec4 &a, constconst F32vec4 &b ){ // mask returned F32vec4 &b ){ // mask returned returnreturn _mm_cmplt_ps(a, b); _mm_cmplt_ps(a, b); }}

SIMD instructionsSIMD instructions

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 99/12/12

Kalman Filter Track Fit on Kalman Filter Track Fit on Intel XeonIntel Xeon, , AMD OpteronAMD Opteron and and CellCell

Motivated, but not restricted by Cell !Motivated, but not restricted by Cell !Motivated, but not restricted by Cell !Motivated, but not restricted by Cell !

lxg1411@GSI

eh102@KIP

blade11bc4 @IBM

• 2 Intel Xeon Processors with Hyper-Threading enabled and 512 kB cache at 2.66 GHz;• 2 Dual Core AMD Opteron Processors 265 with 1024 kB cache at 1.8 GHz;• 2 Cell Broadband Engines with 256 kB local store at 2.4G Hz.

Inte

l P4

Inte

l P4

Cell

Cell

Comp. Phys. Comm. 178 (2008) 374-383Comp. Phys. Comm. 178 (2008) 374-383

10000 10000 faster!faster!

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 1010/12/12

Performance of the KF Track Fit on CPU/GPU SystemsPerformance of the KF Track Fit on CPU/GPU Systems

CBM Progress Report, 2008CBM Progress Report, 2008

Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1

Real-time performance on different Intel CPU platformsReal-time performance on different Intel CPU platforms

Real-time performance on NVIDIA for a single trackReal-time performance on NVIDIA for a single track

CoresCores

HW ThreadsHW ThreadsSIMD widthSIMD width

scalascalarr

doubledouble single ->single -> 22 44 88 1616 3232

1.001.00

10.0010.00

0.100.10

0.010.01

2xCell SPE ( 16 )2xCell SPE ( 16 )Woodcrest ( 2 )Woodcrest ( 2 )Clovertown ( 4 )Clovertown ( 4 )Dunnington ( 6 )Dunnington ( 6 )

Real-time performance (Real-time performance (s) on different CPU architectures – speed-up 100 with 32 threadss) on different CPU architectures – speed-up 100 with 32 threads

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 1111/12/12

Cellular Automaton Track Finder: Cellular Automaton Track Finder: Pentium 4Pentium 4

1000 faster!1000 faster!

May 12, 2009, RT'09, BeijingMay 12, 2009, RT'09, Beijing Ivan Kisel, GSI, GermanyIvan Kisel, GSI, Germany 1212/12/12

SummarySummary

• A standalone package for online event selection is ready for investigationA standalone package for online event selection is ready for investigation• Cellular Automaton track finder takes 5 ms per minimum bias eventCellular Automaton track finder takes 5 ms per minimum bias event• Kalman Filter track fit is a benchmark for modern CPU/GPU architecturesKalman Filter track fit is a benchmark for modern CPU/GPU architectures• SIMDized multi-threaded KF track fit takes 0.1 SIMDized multi-threaded KF track fit takes 0.1 s/track on Intel Core i7s/track on Intel Core i7• Throughput of 2.2·10Throughput of 2.2·1077 tracks /sec is reached on NVIDIA GTX 280 tracks /sec is reached on NVIDIA GTX 280

? OpenCL ?? OpenCL ?? OpenCL ?? OpenCL ?

GamingGaming STI: STI: CellCell

GamingGaming STI: STI: CellCell

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

??

CPUCPU Intel: Intel: XX-coresXX-cores

CPUCPU Intel: Intel: XX-coresXX-cores

FPGAFPGA Xilinx: Xilinx: VirtexVirtex

FPGAFPGA Xilinx: Xilinx: VirtexVirtex

??

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

CPU/GPUCPU/GPU AMD: AMD: FusionFusion??

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla