Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Parallel Implementations for Solving Tridiagonal
Systems
(Sparse Days’18)
Pedro Valero-LaraPh.D. Researcher
27/Sep/2018
Tridiagonal Systems:
• Accelerate the computation of batches of tridiagonal problems ➔ Simulation of the Human Brain
• Accelerate the computation of one large tridiagonal problems
Motivation & IntroductionMotivation:
• Ax=b, where A is a tridiagonal matrix
Thomas Algorithm:• The optimal algorithm
➔ 8n operations in 2n-1 steps➔ Forward & Backward
• SequentialThomas (l,d,u,rhs,n)// Backward Sweepfor i = n − 1 → 0 do
factor = u[i] / d[i];d[i] -= factor × l[i];rhs[i] -= factor × rhs[i];
end forrhs[0] /= d[0];// Forward Sweepfor i = 1 → n-1 do
rhs[i] -= l[i] × rhs[i];rhs[i] /= d[i];
end for 1
Accelerate the computation of batches of tridiagonal problemsImplementation of cuThomasBatch
● Computing about 1.5x10¹⁴ (150 Billions) neurons!!
Simulation of the Human BrainHuman Brain Project (HBP)
void hines solver(double *a, double *b, double *d, double *rhs, int *p, int cell size)
{
int i; double factor;
// backward sweep
for(int i=cell size-1; i>0; −−i) {
factor = a[i] / d[i];
d[p[i]] -= factor * b[i];
rhs[p[i]] -= factor * rhs[i];
}
rhs[0] /= d[0]
// forward sweep
for(i=1; i<cell size; ++i) {
rhs[i] -= b[i] * rhs[p[i]];
rhs[i] /= d[i];
}
}
● Ax=b➔ A is sparse (3 vectors) and symmetric➔ Similar to Tridiagonal System (Thomas algorithm)
● 8×N operations (N = size of the neuron)● Vector p → branches
➔ Jumps in the memory access pattern➔ p stores the “shape” of the neuron
2
• 1 CUDA Block per Tridiagonal System➔ gtsvStridedBatch (cuSparse)➔ Parallel Methods
• CR, PCR, …➔ Saturate the GPU with a “low” number
of Tridiagonal Systems• 1 CUDA thread per Tridiagonal System
➔ cuThomasBatch ➔ Thomas Method➔ Modification on data layout
• Once per simulation➔ Saturate the GPU with a “high” number
of Tridiagonal Systems
Implementation of cuThomasBatchParallel Tridiagonal Solve on GPU:
3
• MinoTauro (one K80 node)➔ 1 logic K40➔ 2496 CUDA cores➔ 12 GB GDDR5➔ 240GB/s
Implementation of cuThomasBatchPlatform test:
Test case:• cuThomasBatch vs gtsvStridedBatch (cuSparse)
➔ For different systems size• Small (from 64 to 512)• Big (from 1024 to 8192)
➔ For different (Batch count) number of Tridiagonal Systems• 256, 2,560, 25,600, 256,000• 20, 200, 2,000, 20,000
4
Implementation of cuThomasBatchcuThomasBatch vs gtsvStridedBatch (cuSparse)
Conclusions:• cuThomasBatch is in need of a “high” Batch count
➔ From 2,560 for “small” systems➔ From 200 for “big” systems
5
Implementation of cuThomasBatchGPU vs Multicore
Conclusions:• cuThomasBatch continues scaling even when computing a high Batch count• gtsvStridedBatch (cuSparse) saturates the GPU with a “low” Batch count
➔ From 2,560 for “small” systems➔ From 200 for “big” systems
6
Implementation of cuThomasBatchMemory Occupancy
Conclusions:• gtsvStridedBatch (cuSparse) is in need of much more memory
➔ Temporal buffers to store data of the different levels• cuThomasBatch is not in need of temporal buffers
➔ 1 CUDA thread per system• cuThomasBatch is more accurate than gtsvStridedBatch (cuSparse)
➔ Thomas method is sequential (error is not extended)
7
Error
Implementation of cuThomasBatchNVIDA Visual Profiler
• Occupancy (92,6%)➔ No divergence➔ Coalesce memory accesses
• MemBandwidth (140 GB/s)➔ 240GB/s
• 25% ECC (60 GB/s)• 180 GB/s Real
➔ 140GB/s (~80%)
cuThomasBatch is part of the NVIDIA cuSparse library● gtsvInterleavedBatch
8
Accelerate the computation of one large tridiagonal systemImplementation of dsss_dgtsv@LASs
• Parallel Methods➔ CR and PCR➔ High number of operations➔ Synchronization between steps
• Our approach➔ Combine PCR and Thomas➔ PCR
• Every step (s), PCR generates 2^s systems of size n/2^s ➔ Thomas
• Optimum method in terms of operations (8n)➔ What is the best switch point between PCR and Thomas?
Implementation of dsss_dgtsv@LASsParallel Tridiagonal Solve on Multicore:
9
• We compute (in terms of # operations) the best SP following on the next equation (theoretical prediction):
((12 × n)/#cores) × SP + (8 × (n/2 S P ))
• Test platform• One node of MareNostrum IV Supercomputer• 2x 24 core Intel Xeon Platinium 8160
• Experimental Results• 4 different variants of the PCR+Thomas• The best variants are in agreement with
the theoretical study
Implementation of dsss_dgtsv@LASsBest Switch Point (SP)
10
• CR, PCR and PCR+Thomas➔ Extrae+Paraver (traces)
• Error• PCR+Thomas < Full PCR
• Speedup w.r.t MKL • MKL is sequential and pivoting
Implementation of dsss_dgtsv@LASsPerformance Analysis
11
Conclusions• Implementation of cuThomasBatch
➔ Up to ~2,5x faster than the ref. routine in NVIDIA cuSparse (gtsvStridedBatch)➔ Less memory space required➔ More numerical accuracy➔ Included into NVIDIA cuSparse with the name of gtsvInterleavedBatch➔ Accessible in BSC repo: https://pm.bsc.es/gitlab/run-math/cuThomasBatch➔ Accessible in cuSparse: https://docs.nvidia.com/cuda/cusparse/
• Implementation of dsss_dgtsv@LASs➔ It is possible (and easy) to compute the best “Swith Point”
➔ Auto-tuning code➔ Much faster and more accurate numerically than the other parallel variants
• PCR and CR.➔ Up to ~4x faster than the ref. routine in Intel MKL➔ LASs is not accessible yet, coming soon!
12
Acknowledgment• European Flagship project “Human Brain Project”
➔ Raül Sirvent• NVIDIA cuSparse team
➔ gtsvInterleavedBatch➔ Lung Sheng Chien, Harun Bayraktar, Alex Fit-Florea
• Announcements➔ PUMPS + AI Summer School
• Advanced CUDA + AI• Wen-Mei Hwu and David Kirk• http://pumps.bsc.es
➔ Open Positions:• https://www.bsc.es/join-us/job-opportunities/
13