View
217
Download
0
Category
Preview:
Citation preview
Event Reconstruction in STSEvent Reconstruction in STS
I. KiselI. KiselGSIGSI
CBM-RF-JINR MeetingCBM-RF-JINR MeetingDubna, May 21, 2009Dubna, May 21, 2009
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 22/24/24
Many-core HPCMany-core HPC
• Heterogeneous systems of many coresHeterogeneous systems of many cores• Uniform approach to all CPU/GPU familiesUniform approach to all CPU/GPU families• Similar programming languages (CUDA, Ct, OpenCL)Similar programming languages (CUDA, Ct, OpenCL)• Parallelization of the algorithm (vectors, multi-threads, many-cores)Parallelization of the algorithm (vectors, multi-threads, many-cores)
• On-line event selectionOn-line event selection• Mathematical and computational optimizationMathematical and computational optimization• Optimization of the detectorOptimization of the detector
? OpenCL ?? OpenCL ?? OpenCL ?? OpenCL ?
GamingGaming STI: STI: CellCell
GamingGaming STI: STI: CellCell
GP CPUGP CPU Intel: Intel: LarrabeeLarrabee
GP CPUGP CPU Intel: Intel: LarrabeeLarrabee
??
CPUCPU Intel: Intel: XX-coresXX-cores
CPUCPU Intel: Intel: XX-coresXX-cores
FPGAFPGA Xilinx: Xilinx: VirtexVirtex
FPGAFPGA Xilinx: Xilinx: VirtexVirtex
??
CPU/GPUCPU/GPU AMD: AMD: FusionFusion
CPU/GPUCPU/GPU AMD: AMD: FusionFusion??
GP GPUGP GPU Nvidia: Nvidia: TeslaTesla
GP GPUGP GPU Nvidia: Nvidia: TeslaTesla
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 33/24/24
Current and Expected Eras of Intel Processor ArchitecturesCurrent and Expected Eras of Intel Processor Architectures
From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005.
• Future programming is 3-dimentionalFuture programming is 3-dimentional• The amount of data is doubling every 18-24 monthThe amount of data is doubling every 18-24 month• Massive data streamsMassive data streams• The RMS (Recognition, Mining, Synthesis) workload in real timeThe RMS (Recognition, Mining, Synthesis) workload in real time• Supercomputer-level performance in ordinary servers and PCsSupercomputer-level performance in ordinary servers and PCs• Applications, like real-time decision-making analysisApplications, like real-time decision-making analysis
CoresCores
HW ThreadsHW Threads
SIMD widthSIMD width
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 44/24/24
Cores and HW ThreadsCores and HW Threads
CPU architecture in CPU architecture in 20092009
CPU of your laptop inCPU of your laptop in 20152015
CPU architecture in CPU architecture in 19XX19XX
1 Process per CPU1 Process per CPU
CPU architecture inCPU architecture in 20002000
2 Threads per Process per CPU2 Threads per Process per CPU
ProcessProcessThread1 Thread2Thread1 Thread2
exeexe r/wr/wr/wr/w exeexeexeexe r/wr/w... ...... ...
Cores and HW threads are seen by an operating system as CPUs:Cores and HW threads are seen by an operating system as CPUs:> cat /proc/cpuinfo> cat /proc/cpuinfo
Maximum half of threadsMaximum half of threadsare executedare executed
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 55/24/24
SIMD WidthSIMD Width
D1D1 D2D2
S4S4S3S3S2S2S1S1
DD
S8S8S7S7S6S6S5S5S4S4S3S3S2S2S1S1
S16S16S15S15S14S14S13S13S12S12S11S11S10S10S9S9S8S8S7S7S6S6S5S5S4S4S3S3S2S2S1S1
Scalar double precision (Scalar double precision (6464 bits) bits)
Vector (SIMD) double precision (Vector (SIMD) double precision (128128 bits) bits)
Vector (SIMD) single precision (Vector (SIMD) single precision (128128 bits) bits)
Intel AVX (2010) vector single precision (Intel AVX (2010) vector single precision (256256 bits) bits)
Intel LRB (2010) vector single precision (Intel LRB (2010) vector single precision (512512 bits) bits)
22 or or 1/21/2
44 or or 1/41/4
88 or or 1/81/8
1616 or or 1/161/16
FasterFaster or or SlowerSlower ? ?
SIMD = Single Instruction Multiple DataSIMD = Single Instruction Multiple DataSIMD uses vector registers SIMD uses vector registers SIMD exploits data-level parallelismSIMD exploits data-level parallelism
CPUCPU
ScalarScalar VectorVectorDD SS SS SS SS
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 66/24/24
SIMD KF Track Fit on Intel Multicore Systems: SIMD KF Track Fit on Intel Multicore Systems: ScalabilityScalability
H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering
Real-time performance on different CPU architectures – speed-up 100 with 32 threadsReal-time performance on different CPU architectures – speed-up 100 with 32 threads
Speed-up 3.7 on the Xeon 5140 (Woodcrest)Speed-up 3.7 on the Xeon 5140 (Woodcrest) Real-time performance on different Intel CPU platformsReal-time performance on different Intel CPU platforms
scalascalarr
doubledouble single ->single -> 22 44 88 1616 3232
1.001.00
10.0010.00
0.100.10
0.010.01
2xCell SPE ( 16 )2xCell SPE ( 16 )Woodcrest ( 2 )Woodcrest ( 2 )Clovertown ( 4 )Clovertown ( 4 )Dunnington ( 6 )Dunnington ( 6 )
# threads# threads
Fit
tim
e,
Fit
tim
e, s
/tra
cks/
track
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 77/24/24
Intel Larrabee: Intel Larrabee: 32 Cores32 Cores
L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008.
LRB vs. GPU:LRB vs. GPU:Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways:Radeon 4000 series in three major ways:• use the x86 instruction set with Larrabee-specific extensions;use the x86 instruction set with Larrabee-specific extensions;• feature cache coherency across all its cores;feature cache coherency across all its cores;• include very little specialized graphics hardware.include very little specialized graphics hardware.
LRB vs. CPU:LRB vs. CPU:The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's LRB's 3232 x86 x86 corescores will be based on the much simpler Pentium design; will be based on the much simpler Pentium design;• each core supports each core supports 4-way 4-way simultaneous simultaneous multithreadingmultithreading, with 4 copies of each processor register; , with 4 copies of each processor register; • each core contains a each core contains a 512-bit vector512-bit vector processing unit, able to process processing unit, able to process 16 single precision floating point16 single precision floating point numbers at a time; numbers at a time;• LRB includes explicit cache control instructions;LRB includes explicit cache control instructions;• LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;• LRB includes one fixed-function graphics hardware unit.LRB includes one fixed-function graphics hardware unit.
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 88/24/24
General Purpose Graphics Processing Units (GPGPU)General Purpose Graphics Processing Units (GPGPU)
• Substantial evolution of graphics hardware over the past yearsSubstantial evolution of graphics hardware over the past years• Remarkable programmability and flexibilityRemarkable programmability and flexibility• Reasonably cheapReasonably cheap• New branch of research – GPGPUNew branch of research – GPGPU
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 99/24/24
NVIDIA HardwareNVIDIA Hardware
S. Kalcher, M. Bach
• Streaming multiprocessorsStreaming multiprocessors• No overhead thread switchingNo overhead thread switching• FPUs instead of cache/controlFPUs instead of cache/control• Complex memory hierarchyComplex memory hierarchy• SIMT – Single Instruction Multiple ThreadsSIMT – Single Instruction Multiple Threads
GT200GT200• 30 multiprocessors30 multiprocessors• 30 DP units30 DP units• 8 SP FPUs per MP 8 SP FPUs per MP • 240 SP units240 SP units• 16 000 registers per MP16 000 registers per MP• 16 kB shared memory per MP16 kB shared memory per MP• >= 1 GB main memory>= 1 GB main memory• 1.4 GHz clock1.4 GHz clock• 933 GFlops SP933 GFlops SP
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1010/24/24
SIMD/SIMT Kalman Filter on the CSC-Scout ClusterSIMD/SIMT Kalman Filter on the CSC-Scout Cluster
CPU1600
GPU9100
M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth
18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB)18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB)++
27xTesla S1070(4x(GT200, 4 GB))27xTesla S1070(4x(GT200, 4 GB))
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1111/24/24
CPU/GPU Programming FrameworksCPU/GPU Programming Frameworks
• Cg, OpenGL Shading Language, Direct XCg, OpenGL Shading Language, Direct X• Designed to write shaders• Require problem to be expressed graphically
• AMD BrookAMD Brook• Pure stream computingPure stream computing• No hardware specificNo hardware specific
• AMD CAL (Compute Abstraction Layer)AMD CAL (Compute Abstraction Layer)• Generic usage of hardware on assembler levelGeneric usage of hardware on assembler level
• NVIDIA CUDA (Compute Unified Device Architecture)NVIDIA CUDA (Compute Unified Device Architecture)• Defines hardware platformDefines hardware platform• Generic programmingGeneric programming• Extension to the C languageExtension to the C language• Explicit memory managementExplicit memory management• Programming on thread levelProgramming on thread level
• Intel Ct (C for throughput)Intel Ct (C for throughput)• Extension to the C languageExtension to the C language• Intel CPU/GPU specificIntel CPU/GPU specific• SIMD exploitation for automatic parallelismSIMD exploitation for automatic parallelism
• OpenCL (Open Computing Language)OpenCL (Open Computing Language)• Open standard for generic programmingOpen standard for generic programming• Extension to the C languageExtension to the C language• Supposed to work on any hardwareSupposed to work on any hardware• Usage of specific hardware capabilities by extensionsUsage of specific hardware capabilities by extensions
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1212/24/24
Cellular Automaton Track FinderCellular Automaton Track Finder
500500
200200
1010
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1313/24/24
L1 CA Track Finder: EfficiencyL1 CA Track Finder: Efficiency
Track categoryTrack category Efficiency, Efficiency, %%
Reference set (>1 Reference set (>1 GeV/c)GeV/c)
95.295.2
All set (≥4 hits,>100 All set (≥4 hits,>100 MeV/c)MeV/c)
89.889.8
Extra set (<1 GeV/c)Extra set (<1 GeV/c) 78.678.6
CloneClone 2.82.8
GhostGhost 6.66.6
MC tracks/ev foundMC tracks/ev found 672672
Speed, s/evSpeed, s/ev 0.80.8
I. Rostovtseva
• Fluctuated magnetic field?Fluctuated magnetic field?• Too large STS acceptance?Too large STS acceptance?• Too large distance between STS stations?Too large distance between STS stations?
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1414/24/24
L1 CA Track Finder: ChangesL1 CA Track Finder: Changes
I. Kulakov
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1515/24/24
L1 CA Track Finder: TimingL1 CA Track Finder: Timing
I. Kulakov
Time
old new
1 thread 2 threads 3 threads
CPU Time [ms] 575 278 321 335
Real Time [ms] 576 286 233 238
old – old version (from CBMRoot DEC08)new – new paralleled version
Statistic: 100 central eventsProcessor: Pentium D, 3.0 GHz, 2 MB.
R [cm] 1010 9 8 7 6 5 4 33 2 1 0.5 -
CPU time [ms] 320320 285 254 220 192 171 149 132132 123 113 106 96
Real time [ms] 233233 213 193 175 154 144 129 120120 108 100 94 85
Ref set 0.970.97 0.97 0.97 0.97 0.97 0.97 0.97 0.970.97 0.97 0.96 0.96 0.96
All set 0.920.92 0.92 0.92 0.92 0.92 0.92 0.92 0.920.92 0.91 0.91 0.91 0.91
Extra 0.810.81 0.81 0.81 0.81 0.81 0.82 0.82 0.820.82 0.81 0.80 0.80 0.80
Clone 0.040.04 0.04 0.04 0.04 0.04 0.04 0.04 0.040.04 0.04 0.04 0.04 0.04
Ghost 0.040.04 0.04 0.04 0.04 0.04 0.04 0.04 0.040.04 0.04 0.04 0.04 0.04
tracks/event 686686 687 687 687 688 688 688 688688 687 684 682 682
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1616/24/24
On-line = Off-line Reconstruction ?On-line = Off-line Reconstruction ?
Off-line and on-line reconstructions will and should be parallelized Off-line and on-line reconstructions will and should be parallelized Both versions will be run on similar many-core systems or even on the same PC farmBoth versions will be run on similar many-core systems or even on the same PC farm Both versions will use (probably) the same parallel language(s), such as OpenCLBoth versions will use (probably) the same parallel language(s), such as OpenCL Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA?Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? If the final code is fast, can we think about a global on-line event reconstruction and selection?If the final code is fast, can we think about a global on-line event reconstruction and selection?
Intel SIMDIntel SIMD Intel MIMDIntel MIMD Intel CtIntel Ct NVIDIA NVIDIA CUDACUDA
OpenCLOpenCL
STSSTS ++ ++ ++ ++ ––
MuChMuCh
RICHRICH
TRDTRD
Your RecoYour Reco
Open Charm Open Charm AnalysisAnalysis
Your AnalysisYour Analysis
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1717/24/24
SummarySummary• Think parallel !Think parallel !• Parallel programming is the key to the full potential of the Tera-scale platformsParallel programming is the key to the full potential of the Tera-scale platforms• Data parallelism vs. parallelism of the algorithmData parallelism vs. parallelism of the algorithm• Stream processing – no branchesStream processing – no branches• Avoid direct accessing main memory, no maps, no look-up-tablesAvoid direct accessing main memory, no maps, no look-up-tables• Use SIMD unit in the nearest future (many-cores, TF/s, …)Use SIMD unit in the nearest future (many-cores, TF/s, …)• Use single-precision floating point where possibleUse single-precision floating point where possible• In critical parts use double precision if necessaryIn critical parts use double precision if necessary• Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …)Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …)• New parallel languages appear: OpenCL, Ct, CUDANew parallel languages appear: OpenCL, Ct, CUDA• GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!!GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!!• Should we start buying them for testing the algorithms now?Should we start buying them for testing the algorithms now?
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1818/24/24
Back-up Slides (1-5)Back-up Slides (1-5)
Back-upBack-up
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 1919/24/24
Back-up Slides (1/5)Back-up Slides (1/5)
Back-upBack-up
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 2020/24/24
Back-up Slides (2/5)Back-up Slides (2/5)
Back-upBack-up
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 2121/24/24
Back-up Slides (3/5)Back-up Slides (3/5)
Back-upBack-up
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 2222/24/24
Back-up Slides (4/5)Back-up Slides (4/5)
Back-upBack-up
SIMD is out of consideration (I.K.)SIMD is out of consideration (I.K.)
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 2323/24/24
Back-up Slides (5/5)Back-up Slides (5/5)
Back-upBack-up
21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 2424/24/24
Tracking WorkshopTracking Workshop
Please be invited to thePlease be invited to the
Tracking WorkshopTracking Workshop15-17 June 2009 at GSI15-17 June 2009 at GSI
Recommended