Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
“A next-generation many-core processor with reliability fault tolerance and adaptive powerreliability, fault tolerance and adaptive power management features optimized for embedded and high performance computingembedded and high performance computing applications.”
Simon McIntosh-Smith, VP of Applications simon@clearspeed [email protected]
ClearSpeed company introduction
• ClearSpeed Federal Systems based in San Jose, CA, founded in 2008
• ClearSpeed Technology HQ in Bristol, UK, founded in 2002
• Deliver the world’s most power efficient, high-performance processors
• Over 100 patents on the core technology• Solutions for defence, embedded systems and high
fperformance computing• Licensed IP to BAE Systems for use in future space-based
tsystems• Partnering with HP, SGI, IBM, and Sun
ClearSpeed’s “Smart SIMD” MTAP processor core
• “Smart SIMD” Multi-Threaded Array Processing:MTAP Processing:– Multi-threading for asynchronous,
overlapped I/O with computeS l bl f P
Peripheral Network
M
System Network
– Scalable array of many Processor Elements (PEs)
– Coarse-grained data parallel
Data Cache
Mono Controller Instruc-
tion Cache
Control and
Debugg p
processing– Supports redundancy and resiliency
Poly Controller
• Programmed in an extended version of ANSI C called Cn:
PE 0
PE 1
PE n-1…
version of ANSI C called C :– Rich expressive semantics – Single “poly” data type modifier
Programmable I/O to DRAM System Network
g p y yp
Smart SIMD Processing Elements (PEs)
M lti l ti it PE• Multiple execution units per PE• Floating point adder 32 & 64-bit
IEEE 754}• Floating point multiplier• Fixed-point MAC 16x16 → 32+64
IEEE 754PEn
P M
ul
P A
dd
MA
C
ALU
}
• Integer ALU with shifter• Load/storeRegister File
128 B t
FP FP M A64 64 64
6464 PEn+1
PEn–1
• Fast inter-PE communication path• Closely coupled, ECC-protected SRAM
128 Bytes
PE SRAMe.g. 6 KBytes
for data• High bandwidth per PE DMA (PIO)
Memory address generator (PIO)
32
128
32
• Per PE address generators• Platform design enables PE variants
PIO Collection & Distribution128
The ClearConnect™ “Network on Chip” system bus
• Scalable, System on Chip (SoC) platform interconnect
CC CC
platform interconnect• Used to connect together all the
major blocks on a chip:NODE NODE
major blocks on a chip:– Multiple MTAP smart SIMD cores– Multiple memory controllers– On-chip memories– System interfaces, e.g. PCI Express
SoCBLOCK
SoCBLOCK
• Unified memory architecture• Distributed arbitration
A B • Scalable bandwidths• Low power design
ClearSpeed’s product range
• Processors– CSX700, CSX600
• Boards– e720, e710, e620, X620
• Systems– CATS-700 1TFLOP in 1U
• SoftwareC il– Compiler
– Libraries– Heterogeneous profilerHeterogeneous profiler– GDB-based debugger
I l d d l MTAP
The CSX700 Processor
• Includes dual MTAP cores:– 96 GFLOPS peak (32 & 64-bit)– 48 GMACS peak (16x16 → 32+64)48 GMACS peak (16x16 → 32 64)– 10 power consumption– 250MHz clock speed
192 P i El t (2 96)– 192 Processing Elements (2x96)– 8 spare PEs for resiliency– ECC on all internal memories
• Dual integrated 64-bit DDR2 memory controllers with ECC
• Integrated PCI Express x16• CCBR chip-to-chip bridge port
2 128 KB t f hi SRAM• 2x128 KBytes of on-chip SRAM• IBM 90nm process• 266 million transistors• 266 million transistors
Reliability, fault tolerance & adaptive power management features
Reliability and fault toleranceReliability and fault tolerance• 8 spare PEs can be kept in reserve• Error Correcting Codes (ECC) on all on chip memories• Error Correcting Codes (ECC) on all on-chip memories• ECC with memory scrubber on all external memories• On die temperature sensors• On-die temperature sensors• Low power and modest clock speed design
P i d l t f l d d ti• Programming model supports graceful degradation
Ad ti tAdaptive power management• Ability for software to modify clock speed dynamically, even
while an application is runningwhile an application is running• On-die temperature sensors allow for thermal ceilings to be
set by softwareset by software• Can also throttle clock speed based on power consumption
CSX700 – Internal Architecture
x96
Powerful software development environment
• Version 3.11 launched in June 08Version 3.11 launched in June 08
• Binary compatible across all ClearSpeed products
• ANSI C-based optimising compiler – Cn
GDB b d d b• GDB-based debugger
• Profiling with heterogeneous, system-wide capabilities
• Libraries (BLAS, RNG, FFT, more…) & High level APIs (CSPX)
• ECLIPSE-based IDE
Eclipse IDE – CSX Debug Perspective
Standard Eclipse graphical debug interface for CSX gprocessor debugging.
CSX processor provides full p phardware debugging of application code.
Provides a seamless view of many processor cores in parallel with their associated state.
Allows full symbolic debug of th Cn lthe Cn language.
Enhanced views for CSX specific informationspecific information.
ClearSpeed profiler for heterogeneous multi-processor systemsHOST/BOARD INTERACTION PROFILINGHOST CODE PROFILING
Advance™ Accelerator BoardHostH t Advance™ Accelerator Board
CSX 600
Pipeline
CSX 600
Pipeline
CPU(s)HostCPU(s)Host
CPU(s)
Advance Accelerator Board
HostCPU(s)
CSX
Pipeline
CSX
Pipeline
CSX PIPELINE PROFILING CSX SYSTEM PROFILING
Defense focus
P t h l ith i l i fProcessor technology with a signal-processing focus• Radar, Sonar
Imaging• Imaging• Target discrimination• Electronic warfareElectronic warfare• Communications intelligence
Engineered for optimal Size, Weight, and Power• Best performance per watt in the industry
Deployable across harsh environments• Air, land, and sea• Commercial, rugged, conduction cooled configurations• Embedded form factors
Embedded/Defense Concept XMC and VXS modulesXMC card• Performance
– 96 GFLOPS, 48 GMACS– 192 GBytes/s peak on-chip bandwidth
8 GBytes/s peak off chip– 8 GBytes/s peak off-chip• High bandwidth via PCI Express I/O
– Up to 2 GBytes/sec via PCIe x2,x4,x8– Up to 4 GBytes/sec via PCIe x16p y
• 15 watts typical, 25 watts max
Vita 41 (VME Switched Serial VXS) Module2 GB DDR2
Vita 41 (VME Switched Serial - VXS) Module• Performance
– 192 GFLOPS, 96 GMACS– 384 GBytes/s peak on-chip bandwidth
2 GB
PCIe x4
PCIe x4
384 GBytes/s peak on chip bandwidth– 16 GBytes/s peak off-chip
• High bandwidth – VME interface (P1 & P2) PCIe x4
DDR2 – ClearConnect CCBR for chip-to-chip– (2) 4x PCIe links off-board (Vita 41.4)
• 30 watts typical, 50 watts max
CONCEPTS, NOT COMMITTED PRODUCTS
Benchmark details
• Power consumption figures are for a single• Power consumption figures are for a single CSX700 running at 0.9V and 42°C
• For 2D FFTs, timing and power consumption measurements include the corner turnmeasurements include the corner turn
• All data always starts and finishes in the off-chip DRAM
• FFTs were run simultaneously on both MTAPFFTs were run simultaneously on both MTAP cores of the CSX700
CSX700 1D FFTs per second
CSX700 1D FFT GFLOPS
CSX700 1D FFT core power
7.4
7.0
7.5
6.5
6.6
6.5
W)
5.95.7
5.5
6.0
re pow
er (W
5.24.95.0
CSX7
00 co
r
128
256
4.5
3 84 0
4.5512
1024
2048
3.6
3.8
3.5
4.0
50MHz 100MHz 150MHz 200MHz 250MHz50MHz 100MHz 150MHz 200MHz 250MHz
Core Clock Speed
CSX700 1D FFT core power GFLOPS per watt
2.6
3.0
2.1
2.42.5
W
1.91.7
2.0
ore GFLOPS/W
1.5
1.7
1.5
CSX7
00 co
128
256
1.21.1
1.0
512
1024
2048
0.7
0.5
50MHz 100MHz 150MHz 200MHz 250MHz
Core Clock Speed
CSX700 2D FFTs per second
CSX700 2D FFT GFLOPS
CSX700 convolutions per second
10 000 00010,000,000
128
256
512
1,362,6371,816,039
2,267,810
1,000,000ond
512
1024
2048
454,215
909,053
, ,
398,708
597,996797,284
995,686
348 092435,204
,000,000
ons pe
r seco
199,524
,
174,218
261,323348,092
114 468152,549
190,808
100,000
1D con
voluti
87,118
38,132
76,330
114,468
32 872
49,28265,734
82,093
1
16,455
32,872
10,000
50MH 100MH 150MH 200MH 250MH50MHz 100MHz 150MHz 200MHz 250MHz
Core Clock Speed
25
CSX700 convolution GFLOPS
22.6
25
18.1 19.8
20
PS
13.6 15.915
ution GFLOP
9.111.9
10
1D con
volu
128
4.5
4 0
7.9
5
256
512
1024
20484.0
0
50MHz 100MHz 150MHz 200MHz 250MHz
2048
50MHz 100MHz 150MHz 200MHz 250MHz
Core Clock Speed
CATS-700 FFT performance
With 12 CSX700 processors the CATS-700 can achieve:• 4.2 million 1024 point 1D FFTs per second (215 GFLOPS)• 26,000 1024x1024 2D FFTs per second (194 GFLOPS)• 25 billion 32-bit inverse square roots• 14 billion 32-bit sines• 12 billion 32-bit exponentials• An aggregate STREAM benchmark of ~80 GBytes/s
CSX700 Transcendental Function Performance
2.5E+09
32‐bit
64‐bit
2.0E+09
econ
d
1.5E+09
tion
s pe
r se
1.0E+09
er of ope
rat
5.0E+08
Num
be
0 0E+000.0E+00
cs_sinp cs_cosp cs_sincos cs_tanp cs_atanp cs_sqrtp cs_isqrtp cs_logp cs_expp