An Overview on Cyclops-64 Architecture · 2007/6/14 SOS11-06-2007.ppt 8 Examples of Type-II Architectures – Intel 80-core Terascale Chip & Larrabee mini-cores chip – IBM 160-core

2007/6/14 SOS11-06-2007.ppt 1

An Overview on Cyclops-64 Architecture- A Status Report on the Programming Model

and Software Infrastructure

Guang R. GaoEndowed Distinguished Professor

Electrical & Computer EngineeringUniversity of Delaware

[email protected]

2007/6/14 SOS11-06-2007.ppt 2

Outline

• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Cyclops-64 Programming Model and

System Software• Future Directions• Summary

2007/6/14 SOS11-06-2007.ppt 3

TIPs of compute power operating on Tera-bytes of dataTIPs of compute power operating on Tera-bytes of data

Transistor Growth in the near futureSource: Keynote talk in CGO & PPoPP 03/14/07 by Jesse Fang from Intel

2007/6/14 SOS11-06-2007.ppt 4

Outline

• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Looking Beyond Cyclops-64• Summary

2007/6/14 SOS11-06-2007.ppt 5

Two Types of Multi-Core Architecture Trends

• Type I: Glue “heavy cores” together with minor changes

• Type II: Explore the parallel architecture design space and searching for most suitable chip architecture models.

2007/6/14 SOS11-06-2007.ppt 6

Multi-Core Type II

• New factors to be considered– Flops are cheap!– Memory per core is small– Cache-coherence is expensive!– On-chip bandwidth can be enormous!– Examples: Cyclops-64, and others

2007/6/14 SOS11-06-2007.ppt 7

Flops are Cheap!

An example to illustrate design tradeoffs:

• If fed from small, local register files:– 3200 GB/s, 10 pJ/op

– < $1/Gflop (60 mW/Gflop)

• If fed from global on-chip memory:– 100 GB/s, 1nJ/op

– ~ $30/Gflop (1W/Gflop)

• If fed from off-chip memory:– 16 GB/s

– ~$200/Gflop (many W/Gflop)

64-bit FP unit

(drawn to scale)

14mm x 14mm chip(130nm and 1GH)

a 64-bit FPU is < 1mm^2 and ~= 50pJCan fit over 200 on a chip.

Curtsey: Numbers are due to Steve Scott [PACT2006 Keynote]

500 FPUs on a chip Is possible![M. Denneau: private communication]

2007/6/14 SOS11-06-2007.ppt 8

Examples of Type-II Architectures

– Intel 80-core Terascale Chip & Larrabeemini-cores chip

– IBM 160-core Cyclops-64 chip– ClearSpeed 96-core CSX chip– Cisco 188-core Metro Chip– Many others are coming

2007/6/14 SOS11-06-2007.ppt 9

Outline

• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Future Directions• Summary

2007/6/14 SOS11-06-2007.ppt 10

Cyclops-64 (C64) Supercomputer

Cabinet(3 BackPlanes)11.52TFlops/144GB

I−Cache

80Gflops

Chip(80 Processors)

Processor

1GflopsProcessor(2 Threads)

4.7MB SRAM

Intra−chip Network

Board(1 Chip)

80Gflops

1GB DRAM

12 x 8

60KB SRAM

C64 System(96 Cabinets)

1.1Pflops/13.8TB

C64

1GB DRAM

Disk

Other Devices

Chip

TU

FPU

GM

SP

TU

GM

SP

3 x 8

3.84TFlops / 48GB

BackPlane(48 Boards)

3 x 8

2007/6/14 SOS11-06-2007.ppt 11

10,000 Feet High System Overview

Front-endProgramming environment

Back-endC64 computing

engineInterconnection

network

Curtsey from E.T. International Inc

1,000 Feet High System Overview

Front-end cluster

Host network Monitoring nodes

Gig Ethernet

C64 computing engine

Admin nodesLogin nodes

Control network

100 Feet High System Overview

Admin

Job scheduler

Monitor System initialization

+Hardware monitoring

ClusterApplication

development+

execution

ETI provides all system softwarefor the C64 supercomputer:● Boot up the system● Monitor the status of HW &SW● System resource manager● Job scheduler● Toolchain for application

development● Etc.

C64 Applications

+C64

microkerneland libraries

2007/6/14 SOS11-06-2007.ppt 14

C-64 Chip Architecture

On-chip bisection BW = 0.38 TB/s, total BW to 6 neighbors = 48GB/sec

2007/6/14 SOS11-06-2007.ppt 15

TiNy-Thread – The API ofA Cyclops-64

Thread Virtual Machine• Multi-chip multiprocessor extension of the

base C64 ISA.• Runs directly on top of C64 HW

architecture.• Takes advantage of C64 HW features to

achieve high scalability.• Three components: thread model, memory

model and synchronization model.

2007/6/14 SOS11-06-2007.ppt 16

Cyclops-64 Thread ModelTiNy Threads (TNT)

• Software thread maps directly to TU.• Thread execution is non-preemptive.

Sleep on wait but don't preempt; no context switch.• Wakeup thru interrupt or wakeup signal.• Each thread controls a region of SPM, allocated at

boot time.• Leverage familiar POSIX thread programming

interface.• Fast thread creation and reuse. Thread creation: 280

cycles.Thread termination: 60 cycles.Thread reuse: 265 cycles.

2007/6/14 SOS11-06-2007.ppt 17

C-64 Memory Consistency Model

• The On-Chip SRAM is sequential consistency compliant (SCC) [ZhangEtAl05]

• Accesses to scratch-pad memory: no hardware coherence is imposed

• A weak memory consistency model (e.g. LC consistency) is effectively studied as a natural choice for the scratch-pad memories.[GaoSarkar00,SarkarGao04]

2007/6/14 SOS11-06-2007.ppt 18

Cyclops-64 Software Toolchain

C C

ompiler

Binutils

(Assem

bler, Linker, etc.)

Simulation Testbed

LASTLASTFASTFAST

TU

FPU

SRAM

TU SRAM

Intra-Chip Network

Chip

Processor I-Cache

TU

FPU

SRAM

TUSRAM

Intra-Chip Network

Chip

ProcessorI-Cache

Function-AccurateSimulation Toolset

Latency-AccurateSimulation Toolset

C64 Kernel

Newlib TNT Lib A-Switch lib

CNet Lib

Prog. APIs (e.g. SHMEM, etc.)

User A

pplicationRegressionTestsuite

Multi-NodeMulti-

ThreadedBenchmark

Mr.Clops Emulator

2007/6/14 SOS11-06-2007.ppt 19

Outline

• Introduction• A New Era – Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64• Future Directions• Summary

2007/6/14 SOS11-06-2007.ppt 20

Challenges and Opportunities

How to exploit the massive on-chip parallelism and bandwidth to tolerate off-chip memory bandwidth and latency?

TU

64 Regs

SRAMSPMDRAM

16GB/s

1GB

2.5MB16KB

1.92TB/s

load 36 cycle / store 18 cycle

load 20 cycle / store 10 cycle320GB/s

640GB/s

Load 2Store 1

Load 1 store 1

2007/6/14 SOS11-06-2007.ppt 21

C64 System Software Features

• Explicit Segmented Shared Memory Hierarchy (hotel vs. cache)

• Non-preemptive multi-threading model(“nap” vs. “sleep”)

• Exploitation of Hardware fine-grain synchronization support

• Plenty of thread units (e.g. helper threads)

2007/6/14 SOS11-06-2007.ppt 22

A Report on Early Experience on C64

• Case Study I: Programming Kernels:– Monte-Carlo– Matrix Multiply– FFT– LU Decomposition– Dynamic programming– SCCA2

• Case Study II: Mapping OpenMP on C64• Other Codes

2007/6/14 SOS11-06-2007.ppt 23

Case Study I: C64 Programming Kernels

• Problem Statement:– How programmable using C64 TNT

programming model ?– What are the set of optimizations that are

effective on C64 ?– What can be learned from this study for C64

compiler writers ?

2007/6/14 SOS11-06-2007.ppt 24

FFT: Performance with Various OptimizationsSpeedup

0

5

10

15

20

25

0 20 40 60 80 100 120

Number of threads

Gflo

ps

Optimizations

0

5

10

15

20

25

1 2 3 4 5 6

Gflo

ps

Data size – 216 double precision 1D FFT

Experiments were conducted on ETI Cyclops-64 Toolchain 1.6.2.

1 – Base parallel version (2-point work unit)

2 – + Using 8-point work unit

3 – + Special approach in the first 4 stages

4 – + Eliminate redundant memory operations (twiddle factors)

5 – + Loop unrolling (bit-reversal permutation)

6 – + Register allocation & Instruction scheduling (manually)

* The experiments were conducted with 128 threads.

ACK: Mike Merrill

To Appear: IPDPS07 Workshop

2007/6/14 SOS11-06-2007.ppt 25

Performance with Optimizations - LU

1 2 4 8 16 32 64 1280

2

4

6

8

10

12

14

16

18

20Execute 1024x1024 LU in SRAM

Number of Threads

Per

form

ance

(GFl

ops)

1 2 3 4 50

2

4

6

8

10

12

14

16

18

20

Performance with Optimizations (128 TUs)

Optimizations

GFl

ops

1 – Base Parallel Version 2 – + Dyn. Repartitioning + Recursion

3 – + Processor Adaptation 4 – + Hardware Barrier

5 – + Reg. Tiling (manually)Work in Progress

Performance of SCCA2 (kernel 4)

#threads C64 SMPs MTA2

4 2917082 5369740 752256

8 5513257 2141457 619357

16 9799661 915617 488894

32 17349325 362390 482681

•reasonable scalability–Scale well with # threads–Linear speedup for #threads < 32

•commodity SMPs has poor performance• Competitive vs. MTA-2

Work In ProgressUnit: TEPS -- Traversed Edges per second

SMPs: 4-way Xeon dual-core, 2MB L2 Cache

2007/6/14 SOS11-06-2007.ppt 27

Outline

• Introduction• Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Future Directions• Summary

2007/6/14 SOS11-06-2007.ppt 28

Outline

• Introduction• A New Era – Multi-Core Chip Technology• IBM Cyclops-64 Architecture/Software• Programming/Compiling for Cyclops-64 • Future Directions• Summary

2007/6/14 SOS11-06-2007.ppt 29

Summary

• It is the program execution model that is important

• The "software problem" is not merely a problem to be solved by software engineers

I am dismayed by the tendency to split computer scienceeducation into “programming" and "hardware" with so littlebrought in about the way they interact.

- Jack B. Dennis

2007/6/14 SOS11-06-2007.ppt 30

Some Cyclops-64 Publications• Juan del Cuvillo, Weirong Zhu, and Guang R. Gao, Landing OpenMP on Cyclops-64: An Efficient Mapping of OpenMP to a many-core System-on-a-chip, 3rd ACM International Conference on Computing Frontiers (CF'06), May 2 - 5, 2006.• Juan del Cuvillo, Weirong Zhu and Guang R. Gao, Towards a Software Infrastructure for the Cyclops-64 Cellular Architecture, 20th International Symposium on High Performance Computing Systems and Applications (HPCS2006), St. John's, Newfoundland and Labrador, Canada, May 14 - 17, 2006. • Ying M. P. Zhang, Taikyeong Jeong, Fei Chen, Haiping Wu, Ronny Nitzsche, and Guang R. Gao, A Study of the On-Chip Interconnection Network for the IBM Cyclops-64 Multi-Core Architecture, 20th International Parallel and Distributed Processing Symposium (IPDPS2006), Rhodes Island, Greece, April 25 - 29, 2006.• Yanwei Niu, Ziang Hu, Kenneth E. Barner, Guang R. Gao, Performance Modelling and Optimization of Memory Access on Cellular Computer Architecture Cyclops64. Network and Parallel Computing, IFIP International Conference, Beijing, China, November 30 - December 3, 2005.

2007/6/14 SOS11-06-2007.ppt 31

Some Cyclops-64 Publications (cont’d)• Juan del Cuvillo, Weirong Zhu, Ziang Hu and Guang R. Gao, FAST: a Functionally

Accurate Simulator Toolset for the Cyclops-64 Cellular Architecture, Workshop on Modeling, Benchmarking, and Simulation (MoBS2005), in conjuction with the 32nd Annual International Symposium on Computer Architecture (ISCA2005), Madison, Wisconsin, June 4, 2005.

• Juan del Cuvillo, Weirong Zhu, Ziang Hu and Guang R. Gao, TiNy Threads: a Thread Virtual Machine for the Cyclops64 Cellular Architecture, 5th Workshop on Massively Parallel Processing (WMPP05), in conjuction with the 19th International Parallel and Distributed Processing Symposium (IPDPS2005), April 4-8, 2005 in Denver, Colorado.

• Yuan Zhang, Weirong Zhu, Fei Chen, Ziang Hu, and Guang R. Gao, Sequential Consistency Revisit: the Sufficient Condition and Method to Reason the Consistency Model of a Multiprocessor-on-a-chip Architecture, The IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN2005), February 15-17, 2005, Innsbruck, Austria.

• Yuanwei Niu, Ziang Hu, and Guang Gao, Parallel Reconstruction for Parallel Imaging Space-RIP on Cellular Computer Architecture, in proceedings of The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, November 9-11, 2004, MIT, Cambridge, MA, USA.

2007/6/14 SOS11-06-2007.ppt 32

Acknowledgements

• Our Sponsors• IBM Cyclops-64 Team (Monty Denneau, et. al.)

• ETI Cyclops-64 Team• Members of CAPSL• Other Collaborators• My Host

2007/6/14 SOS11-06-2007.ppt 33

Contributors to Cyclops-64 Project(Note: this is an incomplete list)

• Hirofumi Sakane• Guangming Tan• Wesley Toland• John Tully• Ioannis Venetis• Matthew Wells• Haiping Wu• Liping Xue• Shuxin Yang• Peiheng Zhang• Yingping Zhang• Yuan Zhang• Weirong Zhu

• Fei Chen• Long Chen• Juan del Cuvillo• Brice Dobry• Alban Douillet• Ge Gan• Guang R. Gao• Geoffery Gerfin• Yuhei Hayashi• Ziang Hu• Dimitrij Krepis• Joseph Manzano• Andrew Russo

Documents

An Overview on Cyclops-64 Architecture · 2007/6/14 SOS11-06-2007.ppt 8 Examples of Type-II Architectures – Intel 80-core Terascale Chip & Larrabee mini-cores chip – IBM 160-core