Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
2007/6/14 SOS11-06-2007.ppt 1
An Overview on Cyclops-64 Architecture- A Status Report on the Programming Model
and Software Infrastructure
Guang R. GaoEndowed Distinguished Professor
Electrical & Computer EngineeringUniversity of Delaware
2007/6/14 SOS11-06-2007.ppt 2
Outline
• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Cyclops-64 Programming Model and
System Software• Future Directions• Summary
2007/6/14 SOS11-06-2007.ppt 3
TIPs of compute power operating on Tera-bytes of dataTIPs of compute power operating on Tera-bytes of data
Transistor Growth in the near futureSource: Keynote talk in CGO & PPoPP 03/14/07 by Jesse Fang from Intel
2007/6/14 SOS11-06-2007.ppt 4
Outline
• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Looking Beyond Cyclops-64• Summary
2007/6/14 SOS11-06-2007.ppt 5
Two Types of Multi-Core Architecture Trends
• Type I: Glue “heavy cores” together with minor changes
• Type II: Explore the parallel architecture design space and searching for most suitable chip architecture models.
2007/6/14 SOS11-06-2007.ppt 6
Multi-Core Type II
• New factors to be considered– Flops are cheap!– Memory per core is small– Cache-coherence is expensive!– On-chip bandwidth can be enormous!– Examples: Cyclops-64, and others
2007/6/14 SOS11-06-2007.ppt 7
Flops are Cheap!
An example to illustrate design tradeoffs:
• If fed from small, local register files:– 3200 GB/s, 10 pJ/op
– < $1/Gflop (60 mW/Gflop)
• If fed from global on-chip memory:– 100 GB/s, 1nJ/op
– ~ $30/Gflop (1W/Gflop)
• If fed from off-chip memory:– 16 GB/s
– ~$200/Gflop (many W/Gflop)
64-bit FP unit
(drawn to scale)
14mm x 14mm chip(130nm and 1GH)
a 64-bit FPU is < 1mm^2 and ~= 50pJCan fit over 200 on a chip.
Curtsey: Numbers are due to Steve Scott [PACT2006 Keynote]
500 FPUs on a chip Is possible![M. Denneau: private communication]
2007/6/14 SOS11-06-2007.ppt 8
Examples of Type-II Architectures
– Intel 80-core Terascale Chip & Larrabeemini-cores chip
– IBM 160-core Cyclops-64 chip– ClearSpeed 96-core CSX chip– Cisco 188-core Metro Chip– Many others are coming
2007/6/14 SOS11-06-2007.ppt 9
Outline
• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Future Directions• Summary
2007/6/14 SOS11-06-2007.ppt 10
Cyclops-64 (C64) Supercomputer
Cabinet(3 BackPlanes)11.52TFlops/144GB
I−Cache
80Gflops
Chip(80 Processors)
Processor
1GflopsProcessor(2 Threads)
4.7MB SRAM
Intra−chip Network
Board(1 Chip)
80Gflops
1GB DRAM
12 x 8
60KB SRAM
C64 System(96 Cabinets)
1.1Pflops/13.8TB
C64
1GB DRAM
Disk
Other Devices
Chip
TU
FPU
GM
SP
TU
GM
SP
3 x 8
3.84TFlops / 48GB
BackPlane(48 Boards)
3 x 8
2007/6/14 SOS11-06-2007.ppt 11
10,000 Feet High System Overview
Front-endProgramming environment
Back-endC64 computing
engineInterconnection
network
Curtsey from E.T. International Inc
1,000 Feet High System Overview
Front-end cluster
Host network Monitoring nodes
Gig Ethernet
C64 computing engine
Admin nodesLogin nodes
Control network
100 Feet High System Overview
Admin
Job scheduler
Monitor System initialization
+Hardware monitoring
ClusterApplication
development+
execution
ETI provides all system softwarefor the C64 supercomputer:● Boot up the system● Monitor the status of HW &SW● System resource manager● Job scheduler● Toolchain for application
development● Etc.
C64 Applications
+C64
microkerneland libraries
2007/6/14 SOS11-06-2007.ppt 14
C-64 Chip Architecture
On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbors = 48GB/sec
2007/6/14 SOS11-06-2007.ppt 15
TiNy-Thread – The API ofA Cyclops-64
Thread Virtual Machine• Multi-chip multiprocessor extension of the
base C64 ISA.• Runs directly on top of C64 HW
architecture.• Takes advantage of C64 HW features to
achieve high scalability.• Three components: thread model, memory
model and synchronization model.
2007/6/14 SOS11-06-2007.ppt 16
Cyclops-64 Thread ModelTiNy Threads (TNT)
• Software thread maps directly to TU.• Thread execution is non-preemptive.
Sleep on wait but don't preempt; no context switch.• Wakeup thru interrupt or wakeup signal.• Each thread controls a region of SPM, allocated at
boot time.• Leverage familiar POSIX thread programming
interface.• Fast thread creation and reuse. Thread creation: 280
cycles.Thread termination: 60 cycles.Thread reuse: 265 cycles.
2007/6/14 SOS11-06-2007.ppt 17
C-64 Memory Consistency Model
• The On-Chip SRAM is sequential consistency compliant (SCC) [ZhangEtAl05]
• Accesses to scratch-pad memory: no hardware coherence is imposed
• A weak memory consistency model (e.g. LC consistency) is effectively studied as a natural choice for the scratch-pad memories.[GaoSarkar00,SarkarGao04]
2007/6/14 SOS11-06-2007.ppt 18
Cyclops-64 Software Toolchain
C C
ompiler
Binutils
(Assem
bler, Linker, etc.)
Simulation Testbed
LASTLASTFASTFAST
TU
FPU
SRAM
TU SRAM
Intra-Chip Network
Chip
Processor I-Cache
TU
FPU
SRAM
TUSRAM
Intra-Chip Network
Chip
ProcessorI-Cache
Function-AccurateSimulation Toolset
Latency-AccurateSimulation Toolset
C64 Kernel
Newlib TNT Lib A-Switch lib
CNet Lib
Prog. APIs (e.g. SHMEM, etc.)
User A
pplicationRegressionTestsuite
Multi-NodeMulti-
ThreadedBenchmark
Mr.Clops Emulator
2007/6/14 SOS11-06-2007.ppt 19
Outline
• Introduction• A New Era – Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64• Future Directions• Summary
2007/6/14 SOS11-06-2007.ppt 20
Challenges and Opportunities
How to exploit the massive on-chip parallelism and bandwidth to tolerate off-chip memory bandwidth and latency?
TU
64 Regs
SRAMSPMDRAM
16GB/s
1GB
2.5MB16KB
1.92TB/s
load 36 cycle / store 18 cycle
load 20 cycle / store 10 cycle320GB/s
640GB/s
Load 2Store 1
Load 1 store 1
2007/6/14 SOS11-06-2007.ppt 21
C64 System Software Features
• Explicit Segmented Shared Memory Hierarchy (hotel vs. cache)
• Non-preemptive multi-threading model(“nap” vs. “sleep”)
• Exploitation of Hardware fine-grain synchronization support
• Plenty of thread units (e.g. helper threads)
2007/6/14 SOS11-06-2007.ppt 22
A Report on Early Experience on C64
• Case Study I: Programming Kernels:– Monte-Carlo– Matrix Multiply– FFT– LU Decomposition– Dynamic programming– SCCA2
• Case Study II: Mapping OpenMP on C64• Other Codes
2007/6/14 SOS11-06-2007.ppt 23
Case Study I: C64 Programming Kernels
• Problem Statement:– How programmable using C64 TNT
programming model ?– What are the set of optimizations that are
effective on C64 ?– What can be learned from this study for C64
compiler writers ?
2007/6/14 SOS11-06-2007.ppt 24
FFT: Performance with Various OptimizationsSpeedup
0
5
10
15
20
25
0 20 40 60 80 100 120
Number of threads
Gflo
ps
Optimizations
0
5
10
15
20
25
1 2 3 4 5 6
Gflo
ps
Data size – 216 double precision 1D FFT
Experiments were conducted on ETI Cyclops-64 Toolchain 1.6.2.
1 – Base parallel version (2-point work unit)
2 – + Using 8-point work unit
3 – + Special approach in the first 4 stages
4 – + Eliminate redundant memory operations (twiddle factors)
5 – + Loop unrolling (bit-reversal permutation)
6 – + Register allocation & Instruction scheduling (manually)
* The experiments were conducted with 128 threads.
ACK: Mike Merrill
To Appear: IPDPS07 Workshop
2007/6/14 SOS11-06-2007.ppt 25
Performance with Optimizations - LU
1 2 4 8 16 32 64 1280
2
4
6
8
10
12
14
16
18
20Execute 1024x1024 LU in SRAM
Number of Threads
Per
form
ance
(GFl
ops)
1 2 3 4 50
2
4
6
8
10
12
14
16
18
20
Performance with Optimizations (128 TUs)
Optimizations
GFl
ops
1 – Base Parallel Version 2 – + Dyn. Repartitioning + Recursion
3 – + Processor Adaptation 4 – + Hardware Barrier
5 – + Reg. Tiling (manually)Work in Progress
Performance of SCCA2 (kernel 4)
#threads C64 SMPs MTA2
4 2917082 5369740 752256
8 5513257 2141457 619357
16 9799661 915617 488894
32 17349325 362390 482681
•reasonable scalability–Scale well with # threads–Linear speedup for #threads < 32
•commodity SMPs has poor performance• Competitive vs. MTA-2
Work In ProgressUnit: TEPS -- Traversed Edges per second
SMPs: 4-way Xeon dual-core, 2MB L2 Cache
2007/6/14 SOS11-06-2007.ppt 27
Outline
• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Future Directions• Summary
2007/6/14 SOS11-06-2007.ppt 28
Outline
• Introduction• A New Era – Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Future Directions• Summary
2007/6/14 SOS11-06-2007.ppt 29
Summary
• It is the program execution model that is important
• The "software problem" is not merely a problem to be solved by software engineers
I am dismayed by the tendency to split computer scienceeducation into “programming" and "hardware" with so littlebrought in about the way they interact.
- Jack B. Dennis
2007/6/14 SOS11-06-2007.ppt 30
Some Cyclops-64 Publications• Juan del Cuvillo, Weirong Zhu, and Guang R. Gao, Landing OpenMP on Cyclops-64: An Efficient Mapping of OpenMP to a many-core System-on-a-chip, 3rd ACM International Conference on Computing Frontiers (CF'06), May 2 - 5, 2006.• Juan del Cuvillo, Weirong Zhu and Guang R. Gao, Towards a Software Infrastructure for the Cyclops-64 Cellular Architecture, 20th International Symposium on High Performance Computing Systems and Applications (HPCS2006), St. John's, Newfoundland and Labrador, Canada, May 14 - 17, 2006. • Ying M. P. Zhang, Taikyeong Jeong, Fei Chen, Haiping Wu, Ronny Nitzsche, and Guang R. Gao, A Study of the On-Chip Interconnection Network for the IBM Cyclops-64 Multi-Core Architecture, 20th International Parallel and Distributed Processing Symposium (IPDPS2006), Rhodes Island, Greece, April 25 - 29, 2006.• Yanwei Niu, Ziang Hu, Kenneth E. Barner, Guang R. Gao, Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64. Network and Parallel Computing, IFIP International Conference, Beijing, China, November 30 - December 3, 2005.
2007/6/14 SOS11-06-2007.ppt 31
Some Cyclops-64 Publications (cont’d)• Juan del Cuvillo, Weirong Zhu, Ziang Hu and Guang R. Gao, FAST: a Functionally
Accurate Simulator Toolset for the Cyclops-64 Cellular Architecture, Workshop on Modeling, Benchmarking, and Simulation (MoBS2005), in conjuction with the 32nd Annual International Symposium on Computer Architecture (ISCA2005), Madison, Wisconsin, June 4, 2005.
• Juan del Cuvillo, Weirong Zhu, Ziang Hu and Guang R. Gao, TiNy Threads: a Thread Virtual Machine for the Cyclops64 Cellular Architecture, 5th Workshop on Massively Parallel Processing (WMPP05), in conjuction with the 19th International Parallel and Distributed Processing Symposium (IPDPS2005), April 4-8, 2005 in Denver, Colorado.
• Yuan Zhang, Weirong Zhu, Fei Chen, Ziang Hu, and Guang R. Gao, Sequential Consistency Revisit: the Sufficient Condition and Method to Reason the Consistency Model of a Multiprocessor-on-a-chip Architecture, The IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN2005), February 15-17, 2005, Innsbruck, Austria.
• Yuanwei Niu, Ziang Hu, and Guang Gao, Parallel Reconstruction for Parallel Imaging Space-RIP on Cellular Computer Architecture, in proceedings of The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, November 9-11, 2004, MIT, Cambridge, MA, USA.
2007/6/14 SOS11-06-2007.ppt 32
Acknowledgements
• Our Sponsors• IBM Cyclops-64 Team (Monty Denneau, et. al.)
• ETI Cyclops-64 Team• Members of CAPSL• Other Collaborators• My Host
2007/6/14 SOS11-06-2007.ppt 33
Contributors to Cyclops-64 Project(Note: this is an incomplete list)
• Hirofumi Sakane• Guangming Tan• Wesley Toland• John Tully• Ioannis Venetis• Matthew Wells• Haiping Wu• Liping Xue• Shuxin Yang• Peiheng Zhang• Yingping Zhang• Yuan Zhang• Weirong Zhu
• Fei Chen• Long Chen• Juan del Cuvillo• Brice Dobry• Alban Douillet• Ge Gan• Guang R. Gao• Geoffery Gerfin• Yuhei Hayashi• Ziang Hu• Dimitrij Krepis• Joseph Manzano• Andrew Russo