Upload
joshua-mora
View
112
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric). This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications. Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics). We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system. We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern.
Citation preview
PERFORMANCE ANALYSIS OF 3D FINITE DIFFERENCE COMPUTATIONAL STENCILS ON SEAMICRO FABRIC COMPUTE SYSTEMS
JOSHUA MORA
2
ABSTRACT
Seamicro fabric compute systems offers an array of low power compute nodes interconnected with a 3D torus network fabric (branded Freedom Supercomputer Fabric).
This specific network topology allows very efficient point to point communications where only your neighbor compute nodes are involved in the communications.
Such type of communication pattern arises in a wide variety of distributed memory applications like in 3D Finite Difference computational stencils, present on many computationally expensive scientific applications (eg. seismic, computational fluid dynamics).
We present the performance analysis (computation, communication, scalability) of a generic 3D Finite Difference computational stencil on such a system.
We aim to demonstrate with this analysis the suitability of Seamicro fabric compute systems for HPC applications that exhibit this communication pattern.
3
AGENDA
HW overview
‒Chassis, compute/storage cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.
4
HW OVERVIEW CHASSIS: FRONT AND BACK VIEWS
FRONT BACK
5
HW OVERVIEW CHASSIS: SIDE VIEW
Total of
4 (quadrants)
x 16 compute cards plugged at both sides
Total of
8 storage cards
x 8 drives each plugged at the front
6
HW OVERVIEW
AMD OpteronTM 4365EE processor, Up to 64GB RAM @ 1333MHz
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8
CPU
RA
M PCI chipset
FB 8FB 2FB 1
7
HW OVERVIEW
AMD OpteronTM 4365EE processor
8 “Piledriver” cores, AVX, FMA3/4
2.0GHz core frequency
Max Turbo core frequency up to 3.1GHz
40W TDP
COMPUTE CARDS: CPU + MEMORY + FABRIC NODES 1-8
Northbridge
Co
re 0
L3 cache
Co
re 1
L2
Co
re 2
Co
re 3
L2
Co
re 4
Co
re 5
L2
Co
re 6
Co
re 7
L2
HT
PH
Y
nCHT
DRAM CTL
2 memory channels
8
HW OVERVIEW
Support for RAID and non RAID
8 HDD 2.5” 7.2k-15k rpm, 500GB-1TB,
Or 8 SSD drives, 80GB-2TB
System can operate without disks
STORAGE CARDS: CPU + MEMORY + FABRIC NODES 1-8 + 8 DISKS
CPU
RA
M
PCI chipset
FB 8FB 2FB 1
H
SATA
Disk1 Disk2 Disk8
FB 8FB 2FB 1
9
HW DESCRIPTION/OVERVIEW
2 x 10Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 2 x 10GbE SFP+
‒ External Port Bandwidth:
‒ 20 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
‒ 32 Gbps Full Duplex.
8 x 1Gb Ethernet Module
‒ External ports
‒ 2 Mini SAS
‒ 8x 1GbE 1000BaseT
‒ External Port Bandwidth:
‒ 8 Gbps Full Duplex
‒ Internal Bandwidth to Fabric:
‒ 32 Gbps Full Duplex.
MANAGEMENT CARDS: ETHERNET MODULES TO CONNECT TO OTHER CHASSIS OR EXTERNAL STORAGE
10
HW OVERVIEW FABRIC TOPOLOGY: 3D TORUS
3D torus network fabric
8 x 8 x 8 Fabric nodes
Diameter (max hop) 4 + 4 + 4 = 12
Theor. cross section bandwidth =
2 (periodic) x 8 x 8 (section) x
2(bidir) x 2.0Gbps/link = 512Gb/s
Compute, storage, mgmt cards
are plugged into the network fabric.
Support for hot plugged compute cards.
11
AGENDA
HW overview
‒Chassis, compute/storage/management cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.
12
SW STACK DESCRIPTION
Overall System Management ‒ Command Line Interface
NOTHING AT ALL CUSTOM FOR INSTALLATION
‒OS support
‒Linux (RH, SLES, CentOS, Ubuntu), Windows®
‒Virtualization
‒VMware, Xen, KVM, HyperV
‒Network SW stack
‒Everything that runs on top of Ethernet HW.
‒File systems
‒Local, shared , parallel.
‒Distributed memory programming
‒MPI, UPC, ..
13
AGENDA
HW overview
‒Chassis, compute/storage/managment cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.
14
MICRO-BENCHMARKS CPU, POWER
Benchmark HPL, leveraging FMA4,3
Single Compute card ‒ 2.0GHz*4CUs*8DP FLOPs/clk/CU*0.83efficiency = 53DP GFLOPs/sec per compute card
‒ 40W TDP processor, 60W per compute card running HPL
=========================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR01L2L4 40000 100 2 4 795.23 5.366e+01 (83% efficiency)
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0033267 ...... PASSED
=========================================================================
Chassis with 64 compute cards ‒ 2.95 DP Teraflops/sec/ chassis
‒ 72% HPL Efficiency/ chassis (MPI over Ethernet)
‒ 5.6kW full chassis running HPL (including power for storage, network fabric and fans).
15
MICRO-BENCHMARKS MEMORY, POWER
Benchmark STREAM
Single Compute card @ 1333MHz memory frequency
‒ 15GB/s
Function Best Rate MB/s Avg time Min time Max time
Copy: 14647.1 0.181456 0.181333 0.181679
Scale: 15221.7 0.175883 0.175615 0.176168
Add: 14741.2 0.270838 0.270557 0.271005
Triad: 15105.2 0.269939 0.269585 0.270251
‒ Power 15 W idle per card.
‒ Power 30 W stream per card.
STREAM chassis
‒ Chassis with 64 compute cards. 960 GB/s (~ 1TB/s) per chassis.
‒ 4.9kW full chassis running Stream (including power for storage, fabric and fans).
16
MICRO-BENCHMARKS NIC TUNING
Ethernet related tuning:
- Ethernet Driver, 8.0.35-NAPI
- InterruptThrottleRate 1,1,1,1,1,1,1,1 at e1000.conf (driver options)
- MTU 9000 (ifconfig)
Interrupt balance fabric nodes to different cores.
MPI TCP tuning
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7
- -mca btl_tcp_eager_limit 1mb (default is 64kb)
UPC tuning
‒ using UDP instead of MPI+TCP
17
MICRO-BENCHMARKS NIC TUNING
MPI related tuning:
8 Ethernet networks, one per fabric node across all 64 compute cards.
OpenMPI, with Ethernet, TCP, defaults to use all networks.
Can be restricted with arguments passed to mpirun command or in openmpi.conf file
- -mca btl_tcp_if_include eth0,eth2,…eth6,eth7
Point to Point communications
Latency: 30-36 usec
Bandwidth: linear scaling from 1 to 8 fabric nodes
1 fabric node: 120 MB/s unidirectional, 190 MB/s bidirectional
8 fabric nodes: 960 MB/s unidirectional, 1500 MB/s bidirectional
18
MICRO-BENCHMARKS NETWORK PERFORMANCE
Point 2 Point benchmark setup
Measure aggregated bandwidth of 1 CPU core for 1 fabric node, 2, 4, and 8 between any 2 compute cards in the chassis.
FB
8FB
2FB
1
FB 8
FB 2
FB 1
CP
U
RAM
PC
I ch
ipse
t
FB 8
FB 2
FB 1 C
PU
RAMP
CI
chip
set
FB 8
FB 2
FB 1
19
MICRO-BENCHMARKS NETWORK PERFORMANCE
0
100
200
300
400
500
600
700
800
900
1000
1 16 256 4096 65536 1048576
ban
dw
idth
(M
B/s
)
msg size (bytes)
Unidirectional MPI bandwidth
eth0 eth0-1 eth0-eth3 eth0-eth7
0
200
400
600
800
1000
1200
1400
1600
1 16 256 4096 65536 1048576
bi b
and
wid
th (
MB
/s)
msg size (bytes)
Bidirectional MPI bandwidth
eth0 eth0-1 eth0-eth3 eth0-eth7
1500MB/s
960MB/s
195MB/s 120MB/s
20
MICRO-BENCHMARKS NETWORK PERFORMANCE
Message rate setup:
Every other core sending messages through each fabric node to another core on another compute card.
4 pairs of MPI processes sending data striped across the 8 fabric nodes until maxing out bandwidth of the fabric.
c0
c2
c4
c6
FB 8
FB 2
FB 1
FB 8
FB 2
FB 1
FB 8
FB 2
FB 1
FB 8
FB 2
FB 1
c0
c2
c4
c6
21
MICRO-BENCHMARKS NETWORK PERFORMANCE
4KB message rate scalability 1,2,4,8 fabric nodes
Maxing out network bandwidth
0
50000
100000
150000
200000
250000
1 2 3 4 5 6 7 8
4K
B M
PI m
ess
age
s/se
con
d
number of fabric nodes per compute card
4KB MPI message rate (bw max out)
1 MPI pair 2 MPI pairs 4 MPI pairs
240K 4K msg/s
160K 4K msg/s
120K 4K msg/s
22
MICRO-BENCHMARKS NETWORK PERFORMANCE
Allreduce setup, for inner products.
<x,y> = 𝑥𝑖 ∗ 𝑦𝑖641
Models well with binary tree algorithm
~30usec*log2(64cards) = 180usec
MPI reduce MPI broadcast
+
# c cards Elapsed time (usec)
2 25.97
4 54.41
8 82.59
16 110.31
32 138.66
64, chassis 170.02
Notice: application described later, has as Xi*Yi as multithreaded (OpenMP) inner product + reduction followed by MPI_Allreduce.
23
Cross section bandwidth measurement.
Sectioned in Z plane.
Aggregated bandwidth in X plane,
MPI multirail stripping messages
across all the fabric nodes within
compute card.
Aggregated in Y plane, distributed.
2 pairs (green and purple) cross
Z-section without congestion on the
links (orange).
Links still not saturated.
8 Xplanes * 8 Yplanes * 4 pairs *1500Mbit/s bidir ASIC [measured] = 384000Mb/s = 48 GB/s.
Measured 43.5GB/s (90.6% network bandwidth utilization) using only 1 core per compute card.
MICRO-BENCHMARKS NETWORK PERFORMANCE
FB FB
FB FB
FB FB
FB FB
Z -
sect
ion
X plane 7
ccard
FB FB
FB FB
FB FB
FB FB
Z -
sect
ion
X plane 0
FB FB
FB FB
FB FB
FB FB
Z -
sect
ion
X plane 7
FB FB
FB FB
FB FB
FB FB
Z -
sect
ion
X plane 0
Y plane 0 Y plane 7
24
MICRO-BENCHMARKS STORAGE PERFORMANCE
Sustained writes (OS caching not leveraged)
1 Vdisk, SATA HDD 7.2k rpms, 64MB cache
For checkpointing:
‒ Iozone sustained writes 45MB/s, 2GB file, 1MB record length.
64 Vdisks concurrently, 1 Vdisk per compute card
‒ Iozone sustained writes 2.88GB/s entire chassis, local file systems.
64 x 2GB files , 1MB record length.
Depending on configuration with same disks can reach up to 95MB/s sustained writes per compute card. 95MB/s x 64disks = 6 GB/s entire chassis
25
AGENDA
HW overview
‒Chassis, compute/storage/management cards, fabric
SW stack description
‒OS, Virtualization, MPI, File system
Micro-benchmarks
‒CPU, memory, network, storage
Application
‒Equations, computation, communication, check-pointing, scalability.
26
Equations
‒Navier Stokes, Wave, Heat-Mass transfer..
Discretization of 3D Laplace’s equation
𝜕2𝑓
𝜕2𝑥+ 𝜕2𝑓
𝜕2𝑦+ 𝜕2𝑓
𝜕2𝑧=0
‒ 8th order scheme, central difference scheme
W4 W3 W2 W1 P E1 E2 E3 E4
Derivative Accuracy −4 −3 −2 −1 0 1 2 3 4
2 8 −1/560 8/315 −1/5 8/5 - 𝟐𝟎𝟓
𝟕𝟐 8/5 −1/5 8/315 −1/560
APPLICATION EQUATIONS AND HIGH ORDER SCHEMES
27
Derived Equation for unknown at position P x(i,j,k), 25 point stencil
𝑾𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟒, 𝒋, 𝒌) + 𝑾𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟑, 𝒋, 𝒌) + 𝑾𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊 − 𝟐, 𝒋, 𝒌 + 𝑾𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 − 𝟏, 𝒋, 𝒌) +
+ E𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟒, 𝒋, 𝒌) + 𝑬𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟑, 𝒋, 𝒌) + 𝑬𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊 + 𝟐, 𝒋, 𝒌 + E𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊 + 𝟏, 𝒋, 𝒌) +
+ S𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟒, 𝒌) + 𝑺𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟑, 𝒌) + 𝑺𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋 − 𝟐, 𝒌 + S𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 − 𝟏, 𝒌) +
+ N𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟒, 𝒌) + N𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟑, 𝒌) + 𝑵𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋 + 𝟐, 𝒌 + N𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋 + 𝟏, 𝒌) +
+ B𝟒 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 − 𝟒 + 𝑩𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 − 𝟑) + 𝐁𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 − 𝟐 + 𝐁𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 − 𝟏) +
+ T𝟒(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟒) + T𝟑(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟑) + 𝑻𝟐 𝒊, 𝒋, 𝒌 ∗ 𝒙 𝒊, 𝒋, 𝒌 + 𝟐 + T𝟏(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌 + 𝟏) +
+ P(𝒊, 𝒋, 𝒌) ∗ 𝒙(𝒊, 𝒋, 𝒌) = 0
The coefficients express how strong is the relationship in the vicinity of P.
25 coef, 25mult, 25 adds,…lots of FMAs per eq.
Linear system of equations to map the domain. A*x=b,
A, square sparse matrix of coefficients (25 diagonals), x, vector of unknowns, b, vector of boundary conditions
z
x
yP
APPLICATION DISCRETIZATION
28
Typically several linear systems coupled, depending on complexity of phenomenology. (CFD usually no less than 5 to 7: U, V, W, P, T, k, e)
Compute card upto 64GB, 8 cores. Upto 8GB/core. Upto 1GB per linear system per core.
25 coef matrix (25 vectors) + unknown (x) + right hand side (b) + residual vector (r) + auxiliary vectors (t) ~ 30 vectors (in 3D).
1 Single Precision (SP) FLOAT is 4bytes.
1GB∗(1SPF/4B)/30eq3
≈ 200 points in each direction per core.
Each core can crunch 8 linear systems one after another with a volume of 200x200x200 points
Each core exchanges halos (data needed for computation but computed on neighbor cores) with a width of 4 points with its neighbors (6: West, East, South, North, Bottom, Top) for 3D partitioning.
200 x 200 x 4halo (SPF) * (4B/1SPF) = 0.61MB communication exchange with each neighbor (6) per linear system at every computation of 200x200x200 points.
200x200x200*(4B/1SPF)= 30MB to checkpoint, remember that HDDs have 64MB cache.
APPLICATION COMPUTING AMOUNT PER CORE AND PER COMPUTE CARD, ARITHMETIC INTENSITY, COMMUNICATION
29
Advantages when using high order schemes:
‒ Reduction of grid at higher order (2nd,4th,8th ) for same accuracy.
‒ Higher Flop/byte , Flop/W ratio at higher order scheme. Due to better utilization of vector instructions. Implementation dependent. Otherwise extremely memory bound.
‒Better network bandwidth utilization due to larger message size of halo at higher order scheme.
‒ Tradeoff: Higher communication volume for higher order scheme
‒Can leverage multirail (MPI over multiple fabric nodes as shown on micro benchmarks) for neighboring communications.
‒Larger messages provide more chances to overlap communication with computation.
‒More network latency tolerant.
APPLICATION IMPACT OF HIGH ORDER SCHEMES ON COMPUTATION EFFICIENCY AND COMP.-COMM. RATIO
30
APPLICATION P2P COMMUNICATION WITH NEIGHBORS, HALO EXCHANGE
EastWest
Top
Bottom
North
South
8 cores per compute card.
Multithreaded computation with OpenMP threads.
Threading only in k loops (i,j,k)
In general case, 6 exchanges (gray area) with neighbor compute cards
halo (message size) of 200x200x4
Best HW mapping 1x8x8 partitions ‒ No partition in X fabric nodes ‒ 8 partitions in Y fabric nodes ‒ 8 partitions in Z fabric nodes
Best algorithm mapping 4x4x4 ‒ Less exchange than 1x8x8
4*(N*N/8) = 4/8 vs 6*(N/4*N/4) = 3/8
31
APPLICATION ITERATIVE ALGORITHM
Solve linear system 1
Solve linear system 2
Solve linear system 7
Solve linear system 8
Overall convergence ?
Solve linear system 1
Solve linear system 2
Solve linear system 7
Solve linear system 8
Solve linear system 1
Solve linear system 2
Solve linear system 7
Solve linear system 8
End
Core 1 Core 2 Core 512 (full chassis)
Yes
Not yet
Linear system equations
Checkpointing values
Bidirectional Exchange of halo/communication with neighbor domain hosted by another core/processor/compute card
32
Compute card with 1 CPU=1 NUMAnode.
No chance for NUMA misses.
Easy to leverage openMP within MPI code without having to worry about remote memory accesses.
Hybrid MPI+openMP for reduction of communication overhead of MPI over Ethernet.
3 Compute units can max-out memory controller bandwidth. (plenty of computing capability)
1 Compute unit/core could be dedicated to I/O (MPI + check-pointing) to fully overlap with computation stages.
Single core per CPU for MPI communications: ‒ Aggregating halo data for all threads to send more data per message. ‒ leveraging MPI non blocking communications for halo exchange. ‒ Leveraging all the fabric nodes per compute card to aggregate network bandwidth
(0.6MB/message) ‒ Hybrid reduction for inner products using Allreduce communication + openMP reduction.
APPLICATION PROGRAMMING PARADIGM AND EXECUTION CONFIGURATION
33
Strong scaling analysis for 4 billion cells (1600x1600x1600) in single precision
(no cheating with weak scaling)
Starting with 8 compute cards (1 z plane), ~55GB per card (64GB available per card), scaling all the way to 64 cards (8 z planes), ~1GB per core , for 512 cores in chassis.
APPLICATION PERFORMANCE SUMMARY
# compute
cards
Computation
Mcells/ Sec
Speed up
Wrt 8 cards
Efficiency Wrt 8 cards
Comm overhead wrt total time
Halo exchange
reduction
8 273 8 100% 5.5% 6%
16 536 15.7 98.1% 5.5% 7%
32 1065 31.2 97.5% 5.7% 8%
64 2048 60.0 93.7% 5.8% 11%
8
16
32
64
8 16 32 64
Spe
ed
up
# compute cards
3DFD speed up on Seamicro
Speed up
Ideal
No change in total volume exchanged
when increased card count, constant comm overhead.
Expected increase, when increased card count,
as shown in Allreduce micro benchmark
34
CONCLUSIONS
Proven suitability (ie. scalability) for 3D finite difference stencil computations, leveraging latest software programming paradigms (MPI + openMP).
‒This is a proxy for many other High Performance Computing applications with similar computational requirements (eg. Manufacturing, Oil and Gas, Weather...)
System Advantages:
‒High computing density (performance and performance per Watt) in 10U form factor
‒Per compute/storage card
‒Scalability provided through Seamicro fabric
‒High flexibility in compute, network, storage configurations adjusted to your workload requirements as demonstrated in this application.
35
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.