Upload
openstack-korea-community
View
15.982
Download
2
Embed Size (px)
Citation preview
1 © AppliedMicro Proprietary & Confidential
Introduction to X-Gene® Technology
February 2016
2 © AppliedMicro Proprietary & Confidential
AppliedMicro X-Gene® Processor Philosophy
• Few workloads are compute bound – Most are limited by memory capacity, bandwidth, or I/O
– HPC workloads are better served by GPGPU
• Scale-out versus scale-up – High density
– Performance per Watt
– Performance per $
• Balance – Strong CPU with an optimized ARMv8 core
– Large memory – adequate memory is not an upsell
– Low power – power efficiency is not an upsell
• Open Source – Open Source Software
– Open Source Hardware
3 © AppliedMicro Proprietary & Confidential
X-Gene® 1 Processor Fully Integrated Server-on-a-Chip
• 8 Custom ARMv8 64-bit Cores – Up to 2.4 GHz
– 8MB shared L3 cache
• Integrated Memory Controllers – 4 channel DDR3-1600
• Integrated Networking – Dual 10 Gb Ethernet SFP+
– Quad 1 Gb Ethernet (SGMII)
• Integrated Storage – 6 lanes of Serial-ATA 3
• Integrated I/O Interfaces – 17 lanes of PCI-Express® gen3
– 5 controllers
• 45 Watt TDP
Coherent Network
I/O Network
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
8MB
Shared L3
Serial ATA 3 PCI-Express® 3 (x17/ 5 controllers)
USB
2.0
10 Gb Ethernet (2)
1 Gb Ethernet (4)
GPIO
SPI
UART
I2C
ARM
Cortex M3
I/D Cache
System Management
Power Management
PKE Engine
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
I/O
Bridge
In Production
4 © AppliedMicro Proprietary & Confidential
X-Gene® 2 Processor Scale-Out Optimized Server-on-a-Chip
• 8 Custom ARMv8 64-bit Cores – Up to 2.8 GHz
– Expanded instruction set
– 10% higher performance
• Genome Coherent Clustering
• Integrated Memory Controllers – 4 channel DDR3-1866
• Integrated Networking – Dual 10 Gb Ethernet (KR)
– RDMA over Ethernet support
• Integrated Storage – 6 channel Serial-ATA 3
• Integrated I/O Interfaces – x8 PCI-Express®
– Full I/O virtualization (SMMU)
• 35 Watt TDP – 50% higher performance / Watt
Coherent Network
I/O Network
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
72-Bit
DDR3
8MB
Shared L3
Serial ATA 3 PCI-Express® 3 (x8/ 3 controllers)
USB
2.0
10 Gb Ethernet (2)
1 Gb Ethernet (4)
GPIO
SPI
UART
I2C
ARM
Cortex M3
I/D Cache
System Management
Power Management
PKE Engine
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
ARM v8
L1 I$
L1 D$
ARM v8
L1 I$
L1 D$
L2 Cache
PMD
I/O
Bridge
Sampling Now
5 © AppliedMicro Proprietary & Confidential
X-Gene® Processor Performance
6 © AppliedMicro Proprietary & Confidential
Primary Workloads
Web Tier
Web Serving / Proxy | Apache, NGINX, HAProxy
Web Apps / Hosting | Drupal, WordPress, Rails
Web Caching | Memcached, Redis, Squid
Database | MySQL, MongoDB, Cassandra
Cold Storage
Big Data
Data Analytics
Cold Storage | CEPH, GlusterFS, Openstack Swift
Big Data | Hadoop MapReduce, Spark
Data Analytics | Lucene, ElasticSearch, Hive
HPC HPC | CPU / GPU combination workloads
7 © AppliedMicro Proprietary & Confidential
AppliedMicro ARMv8 Core Performance Frequency-Independent
3.4
4.2
4.6
Dh
rys
ton
e M
IPs
/ M
Hz / C
ore
1.0
2.0
3.0
4.0
5.0
6.0
5.0
6.5 A
tom
C2
75
0
Cort
ex A
-57
Xe
on
E3
(H
asw
ell
– 2
2n
m)
X-G
en
e® 2
X-G
en
e® 1
Core Performance: Must be competitive… but it does not
tell the full story Up to 40% faster than
Intel Atom
Up to 80% of the
performance of
Xeon®… but with large
memory and lower
power
X-Gene 1
8-core
128 GB
45 Watts
X-Gene 2
8-core
128 GB
35 Watts
Xeon E3
4-core
32 GB
80 Watts
Up to 10% faster than
ARM Cortex A-57…
but higher frequency
8 © AppliedMicro Proprietary & Confidential
Enterprise Workload Performance Web Server (WRK Benchmark)
AppliedMicro
X-Gene® 2
Intel Xeon®
E5-2630v3
1038
771
2.4
4.4
8.5
6.3
Bandwidth (higher is better)
Latency (lower is Better)
Performance (higher is better)
KRPS
KRPS
ms
ms
Gbps
Gbps
X-Gene 2 (8c @ 2.4 GHz)
• 4 node 1U / ½ width sled
• 64GB DDR3-1600
• 4 x 10GbE (integrated)
• Wall power: ~190 Watts
Xeon e5-2630v3 (8c/16t @ 2.4 GHz)
• 2P 1U / ½ width sled
• 64GB DDR4-2133
• 4 x 10GbE (NIC)
• Wall power: ~180 Watts
Up to 35% Higher Performance | Lower TCO Standard CPU benchmarks do not always translate to delivered
workload performance
9 © AppliedMicro Proprietary & Confidential
MongoDB Performance with YCSB Real World In-Memory Database Workload
1U / 2P Rack Server Intel™ Xeon® E5-2630v3
• 16C/32T 2.4GHz Turbo/HT • 64GB DDR4-1866
2-port 10GbE Mellanox NIC CentOS 7.1
24-port 10GbE Netgear™ Switch
Client Intel™ Xeon® 2P e5-2630v2 • 12C/24T 2.6GHz Turbo /HT • 64GB DDR3-1600 2-port 10GbE Mellanox NIC CentOS 7.1
HP Moonshot m400 AppliedMicro X-Gene® CPU
• 8-Core 2.4GHz • 64GB DDR3-1333
10GbE Integrated Ethernet RHELSA 7.1
Hardware Topology
• Server Command • mongoDB version 2.4.9 • compile options scons mongodb –usev8=false • Run options ./mongod --dbpath /root/mongod/data4 --nojournal --logpath=/root/tuan/mongod4_log --port 27021 --logappend
* Benchmark Command
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27017 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27018 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27019 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27020 –threads $threads
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27021 –threads $threads
*YCSB from github at commit 5ab241
1x10GbE 1x10GbE
1x10GbE
1GbE
10 © AppliedMicro Proprietary & Confidential
Single Moonshot m400 Cartridge =
~50% performance of a full 1U 2P
Intel Xeon® E5 Haswell server
MongoDB Performance with YCSB Real World In-Memory Database Workload
36 48
89
109
55
100
188
252
0
50
100
150
200
250
300
1 thread 2 threads 5 threads 10 threads
Pe
rfo
rman
ce o
ps/
sec
(Th
ou
san
ds)
Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
25
48
90 100
7 14
30 38
0
20
40
60
80
100
120
1 thread 2 threads 5 threads 10 threads
CP
U U
tiliz
atio
n (
%)
CPU Utilization
HP Moonshot m400
2P E5-2630v3 Haswell
11 © AppliedMicro Proprietary & Confidential
MongoDB Performance with YCSB Real World In-Memory Database Workload
42
U R
ack
9 M
oo
nsh
ot m
40
0 ch
assis/ra
ck 4
0 2
P E5
-26
30
v3 H
asw
ell 1U
servers/rack
Rack-level Scalability =
5x the Performance versus full 1U 2P
Intel Xeon® E5 Haswell rack
15
19
36
44
2 4
8 10
0
5
10
15
20
25
30
35
40
45
50
1 thread 2 threads 5 threads 10 threads
Pe
rfo
rman
ce o
ps/
sec
(Mill
ion
s)
Rack Level Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
12 © AppliedMicro Proprietary & Confidential
1U / 2P Rack Server Intel™ Xeon® E5-2630v3
• 16C/32T 2.4GHz Turbo/HT • 64GB DDR4-1866
2-port 10GbE Mellanox NIC CentOS 7.1
24-port 10GbE Netgear™ Switch
Client Intel™ Xeon® 2P e5-2630v2 • 12C/24T 2.6GHz Turbo /HT • 64GB DDR3-1600 2-port 10GbE Mellanox NIC CentOS 7.1
HP Moonshot m400 AppliedMicro X-Gene® CPU
• 8-Core 2.4GHz • 64GB DDR3-1333
10GbE Integrated Ethernet RHELSA 7.1
Hardware Topology
• Server Command • PostgreSQL version 9.4.4 • compile options ./configure Run options su postgres -c '/usr/local/pgsql/bin/postgres -F -D /home/postgres/data -p 5432 &'
Postgres Performance with BenchmarkSQL Real World In-Memory Database Workload
PostgreSQL Database Contents 32 Warehouses with 100,000 parts inventory/warehouse 10 districts/warehouse 3000 customers/district 1 terminal/district = 1 operator/district to serve 3000 customers Total Customers = 32*10*3000 = 960,000 PostgreSQL Database Operations New Order = 45% Payment = 43% Order Status = 4% Delivery = 4% Stock Level = 4%
• Benchmark Command • BenchmarkSQL version 4.1.0 cd ~/benchmarksql-4.1.0/run ./runBenchmark.sh props.pg progs.pg file content driver=org.postgresql.Driver conn=jdbc:postgresql://10.76.191.182:5432/postgres user=benchmarksql password=amcc1234 warehouses=32 terminals=8 runTxnsPerTerminal=0 runMins=2 limitTxnsPerMin=0 newOrderWeight=45 paymentWeight=43 orderStatusWeight=4 deliveryWeight=4 stockLevelWeight=4
1x10GbE 1x10GbE
1x10GbE
1GbE
13 © AppliedMicro Proprietary & Confidential
Single Moonshot m400 Cartridge =
~90% performance of a full 1U 2P
Intel Xeon® E5 Haswell server
Postgres Performance with BenchmarkSQL Real World In-Memory Database Workload
8 15
25
38
52
70 69
4 6 10
18
36
76
120
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64
Tran
sact
ion
s P
er
Min
ute
(Th
ou
san
ds)
Terminals (1 terminal = 1operator serving 3000 customers/district)
Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
8% 17%
30%
49%
70%
97% 100%
1% 2% 3% 5% 10% 18%
28%
0%
20%
40%
60%
80%
100%
120%
1 2 4 8 16 32 64
CP
U U
tiliz
atio
n (
%)
Terminals (1 terminal = 1operator serving 3000 customers/district)
CPU Utilization
HP Moonshot m400
2P E5-2630v3 Haswell
14 © AppliedMicro Proprietary & Confidential
Rack Level Scalability =
8X-9X the performance versus full 1U 2P Intel Xeon® E5 Haswell rack
Postgres Performance with BenchmarkSQL Real World In-Memory Database Workload
3170
6066
10106
15255
21171
28329 28143
146 222 381 724 1458
3023
4783
0
5000
10000
15000
20000
25000
30000
1 2 4 8 16 32 64
Tran
sact
ion
s P
er
Min
ute
(Th
ou
san
ds)
Terminals (1 terminal = 1 operator serving 3000 customers/district)
Rack Level Throughput
HP Moonshot m400
2P E5-2630v3 Haswell
42
U R
ack
9 M
oo
nsh
ot m
40
0 ch
assis/ra
ck 4
0 2
P E5
-26
30
v3 H
asw
ell 1U
servers/rack
15 © AppliedMicro Proprietary & Confidential
X-Gene® Technology in the Data Center
16 © AppliedMicro Proprietary & Confidential
Moor Insights and Strategy Paper “The First Enterprise Class 64-Bit ARMv8 Server: HP Moonshot System’s HP ProLiant m400 Server Cartridge”
• HP ProLiant™ m400 “Moonshot” Cartridge
– Production shipments as of 4Q, 2014
• AppliedMicro X-Gene® Processor
– First production 64-bit ARMv8-based server SoC
– Server-class performance with mobile efficiency
• Web Tier/Caching Solution
– Service providers and commercial internet providers
– Early adopters of ARM servers
– ARM server/mobile software development community
Copyright ©2014 Moor Insights & Strategy All Rights Reserved
Moor Insights and Strategy Whitepaper outlining product details and
TCO analysis: http://www.moorinsightsstrategy.com/?p=4753
35% Lower TCO for Scale-Out Web-Tier / Caching Environments
17 © AppliedMicro Proprietary & Confidential
X-Gene® Technology in the Public Cloud
"This is probably the first example of Moonshot AArch64 running in
Europe outside of HP’s development labs, and certainly the first
example of generally available Moonshot backed AArch64 instances in
an OpenStack public cloud anywhere in the world.”
Dr. Mike Kelly, CEO and founder of DataCentred
18 © AppliedMicro Proprietary & Confidential
Ethernet Switch
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
X-Gene
D I M M
D I M M
D I M M
D I M M
Xeon E5-2660 v3
Xeon E5-2660 v3
IO Chipset
10G NIC
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
D I M M
Xeon ® 1U Server X-Gene ® 1U Server
No of Xeon® vCPUs 32 No of X-Gene® 2 ARM vCPUs 64
General Purpose Instances 1x m3.medium, m3.large, m3.xlarge
General Purpose Instances 2x m3.medium, m3.large, m3.xlarge
Memory Optimized Instances 1x r3.large, r3.xlarge, r3.2xlarge
Memory Optimized Instances 2x r3.large, r3.xlarge, r3.2xlarge
X-Gene® 2 vCPU: 2x density per 1U server
19 © AppliedMicro Proprietary & Confidential
X-Gene® 2 v/s Intel Xeon® : Rack Level TCO 2x density; 45-50% Instance Cost Reduction
Xeon® Rack X-Gene® 2 Rack
Rack Power: 13KW Rack Power: 13KW
General Purpose Instances
Rack Cost: $145K Rack Cost: $145K
No of vCPUs: 1280 No of vCPUs: 2560
Instance Cost: 1x Instance Cost: 0.5x
Rack Power: 14KW Rack Power: 16KW
Memory Optimized Instances
Rack Cost: $170K Rack Cost: $190K
No of vCPUs: 1280 No of vCPUs: 2560
Instance Cost: 1x Instance Cost: 0.55x
X-Gene® 2 ARM Instances at 45 – 50% lower prices than Xeon® Instances
20 © AppliedMicro Proprietary & Confidential
X-Gene® Platform Ecosystem
21 © AppliedMicro Proprietary & Confidential
Storage
La
ng
ua
ge
s
Application Workloads Web Tier & Storage
Web Proxy
Web Apps
Web Caching
Database
Web Server
Swift
Cinder
22 © AppliedMicro Proprietary & Confidential
NFV Update
SAE PGW SGW
GTP IP
OVS
Open Source
Linux
X-Gene 1
Evaluation Platform
Evolved Packet Core
(EPC)
Tieto TIP Stack
APM Software
Traffic Generator Traffic
Generator GUI
• AppliedMicro Working with Tieto to Port TIP Stack to X-Gene
• Demo of EPC and TIP using OVS with Traffic Generator
• Multiple Networking Functions Virtualized
– SAE Core
– Packet Data Network Gateway
– Serving Gateway
• Schedule – Completion December, 2015
23 © AppliedMicro Proprietary & Confidential
X-Gene® ARM Processor Software Ecosystem
PEAP
UBOOT
Compilers
Management UEFI
UEFI
24 © AppliedMicro Proprietary & Confidential
X-Gene® 1 Platform Ecosystem Multiple SKUs from Leading OEM and ODM Partners
25 © AppliedMicro Proprietary & Confidential
AppliedMicro Genome™ Platform New Platform for X-Gene® 2 Technology
• Memory coherent framework for scale-out computing
– Enables the positives of scale-up in a scale-out platform
– Utilizes PCIe and interconnect 10 GbE fabric
– Bare metal hypervisor software approach
• Avoids hardware SMP complexity and cost
• Benefits
– More performance per node to address more workloads
• More cores & more memory
• 4 or more X-Gene processors per node
– Single IP address across all processors in a node
– Single Operating System image across all processors in a node
– No software modification required
26 © AppliedMicro Proprietary & Confidential
Genome™ Coherent Solution
Genome™ RDMA / PCIe® Low Latency Coherent Fabric
8-core X-Gene 2
32 GB DDR3
Application Software
32-core SMP Linux
8-core X-Gene 2
32 GB DDR3
8-core X-Gene 2
32 GB DDR3
8-core X-Gene 2
32 GB DDR3
PCIe Gen3 Fabric
1TB
10 Gigabit Ethernet Fabric
10 GbE
Switch
PCIe
Switch
2 x 10GbE
to TOR
Switch
BMC
Scalable performance across multiple sockets
Customizable for each application – configurable nodes based on need
Lowest power and price for the target application
½ width 1U, 4 node reference implementation
Options: more nodes, more memory
27 © AppliedMicro Proprietary & Confidential
X-Gene® 2 Genome™ Development Platform Gryphon
• Platform – 4 X-Gene 2 processor nodes
– Up to 2.8 GHz
– 2 DDR3 channels / node • Up to 64 Gbytes / node (2 DIMMs)
• DDR3-1866
• Form Factor – 1U ½ width
– Dimensions: • Depth: 27.5” (698mm)
• Width: 17.64” (448mm)
• Height: 1.75” (44.5mm)
• Features – 1 SATA HDD/SSD / node
– 16-port 10GbE switch / sled • 2x10G XFI to top-of-rack switch
– 1GbE management port / sled
– 8-port PCI-Express® Gen 3 switch
– BMC • IPMI v2.0
– On board thermal sensors
• Power – 220 Watts
28 © AppliedMicro Proprietary & Confidential
Thank You!