28
1 © AppliedMicro Proprietary & Confidential Introduction to X-Gene ® Technology February 2016

[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

Embed Size (px)

Citation preview

Page 1: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

1 © AppliedMicro Proprietary & Confidential

Introduction to X-Gene® Technology

February 2016

Page 2: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

2 © AppliedMicro Proprietary & Confidential

AppliedMicro X-Gene® Processor Philosophy

• Few workloads are compute bound – Most are limited by memory capacity, bandwidth, or I/O

– HPC workloads are better served by GPGPU

• Scale-out versus scale-up – High density

– Performance per Watt

– Performance per $

• Balance – Strong CPU with an optimized ARMv8 core

– Large memory – adequate memory is not an upsell

– Low power – power efficiency is not an upsell

• Open Source – Open Source Software

– Open Source Hardware

Page 3: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

3 © AppliedMicro Proprietary & Confidential

X-Gene® 1 Processor Fully Integrated Server-on-a-Chip

• 8 Custom ARMv8 64-bit Cores – Up to 2.4 GHz

– 8MB shared L3 cache

• Integrated Memory Controllers – 4 channel DDR3-1600

• Integrated Networking – Dual 10 Gb Ethernet SFP+

– Quad 1 Gb Ethernet (SGMII)

• Integrated Storage – 6 lanes of Serial-ATA 3

• Integrated I/O Interfaces – 17 lanes of PCI-Express® gen3

– 5 controllers

• 45 Watt TDP

Coherent Network

I/O Network

72-Bit

DDR3

72-Bit

DDR3

72-Bit

DDR3

72-Bit

DDR3

8MB

Shared L3

Serial ATA 3 PCI-Express® 3 (x17/ 5 controllers)

USB

2.0

10 Gb Ethernet (2)

1 Gb Ethernet (4)

GPIO

SPI

UART

I2C

ARM

Cortex M3

I/D Cache

System Management

Power Management

PKE Engine

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

I/O

Bridge

In Production

Page 4: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

4 © AppliedMicro Proprietary & Confidential

X-Gene® 2 Processor Scale-Out Optimized Server-on-a-Chip

• 8 Custom ARMv8 64-bit Cores – Up to 2.8 GHz

– Expanded instruction set

– 10% higher performance

• Genome Coherent Clustering

• Integrated Memory Controllers – 4 channel DDR3-1866

• Integrated Networking – Dual 10 Gb Ethernet (KR)

– RDMA over Ethernet support

• Integrated Storage – 6 channel Serial-ATA 3

• Integrated I/O Interfaces – x8 PCI-Express®

– Full I/O virtualization (SMMU)

• 35 Watt TDP – 50% higher performance / Watt

Coherent Network

I/O Network

72-Bit

DDR3

72-Bit

DDR3

72-Bit

DDR3

72-Bit

DDR3

8MB

Shared L3

Serial ATA 3 PCI-Express® 3 (x8/ 3 controllers)

USB

2.0

10 Gb Ethernet (2)

1 Gb Ethernet (4)

GPIO

SPI

UART

I2C

ARM

Cortex M3

I/D Cache

System Management

Power Management

PKE Engine

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

ARM v8

L1 I$

L1 D$

ARM v8

L1 I$

L1 D$

L2 Cache

PMD

I/O

Bridge

Sampling Now

Page 5: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

5 © AppliedMicro Proprietary & Confidential

X-Gene® Processor Performance

Page 6: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

6 © AppliedMicro Proprietary & Confidential

Primary Workloads

Web Tier

Web Serving / Proxy | Apache, NGINX, HAProxy

Web Apps / Hosting | Drupal, WordPress, Rails

Web Caching | Memcached, Redis, Squid

Database | MySQL, MongoDB, Cassandra

Cold Storage

Big Data

Data Analytics

Cold Storage | CEPH, GlusterFS, Openstack Swift

Big Data | Hadoop MapReduce, Spark

Data Analytics | Lucene, ElasticSearch, Hive

HPC HPC | CPU / GPU combination workloads

Page 7: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

7 © AppliedMicro Proprietary & Confidential

AppliedMicro ARMv8 Core Performance Frequency-Independent

3.4

4.2

4.6

Dh

rys

ton

e M

IPs

/ M

Hz / C

ore

1.0

2.0

3.0

4.0

5.0

6.0

5.0

6.5 A

tom

C2

75

0

Cort

ex A

-57

Xe

on

E3

(H

asw

ell

– 2

2n

m)

X-G

en

e® 2

X-G

en

e® 1

Core Performance: Must be competitive… but it does not

tell the full story Up to 40% faster than

Intel Atom

Up to 80% of the

performance of

Xeon®… but with large

memory and lower

power

X-Gene 1

8-core

128 GB

45 Watts

X-Gene 2

8-core

128 GB

35 Watts

Xeon E3

4-core

32 GB

80 Watts

Up to 10% faster than

ARM Cortex A-57…

but higher frequency

Page 8: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

8 © AppliedMicro Proprietary & Confidential

Enterprise Workload Performance Web Server (WRK Benchmark)

AppliedMicro

X-Gene® 2

Intel Xeon®

E5-2630v3

1038

771

2.4

4.4

8.5

6.3

Bandwidth (higher is better)

Latency (lower is Better)

Performance (higher is better)

KRPS

KRPS

ms

ms

Gbps

Gbps

X-Gene 2 (8c @ 2.4 GHz)

• 4 node 1U / ½ width sled

• 64GB DDR3-1600

• 4 x 10GbE (integrated)

• Wall power: ~190 Watts

Xeon e5-2630v3 (8c/16t @ 2.4 GHz)

• 2P 1U / ½ width sled

• 64GB DDR4-2133

• 4 x 10GbE (NIC)

• Wall power: ~180 Watts

Up to 35% Higher Performance | Lower TCO Standard CPU benchmarks do not always translate to delivered

workload performance

Page 9: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

9 © AppliedMicro Proprietary & Confidential

MongoDB Performance with YCSB Real World In-Memory Database Workload

1U / 2P Rack Server Intel™ Xeon® E5-2630v3

• 16C/32T 2.4GHz Turbo/HT • 64GB DDR4-1866

2-port 10GbE Mellanox NIC CentOS 7.1

24-port 10GbE Netgear™ Switch

Client Intel™ Xeon® 2P e5-2630v2 • 12C/24T 2.6GHz Turbo /HT • 64GB DDR3-1600 2-port 10GbE Mellanox NIC CentOS 7.1

HP Moonshot m400 AppliedMicro X-Gene® CPU

• 8-Core 2.4GHz • 64GB DDR3-1333

10GbE Integrated Ethernet RHELSA 7.1

Hardware Topology

• Server Command • mongoDB version 2.4.9 • compile options scons mongodb –usev8=false • Run options ./mongod --dbpath /root/mongod/data4 --nojournal --logpath=/root/tuan/mongod4_log --port 27021 --logappend

* Benchmark Command

./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27017 –threads $threads

./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27018 –threads $threads

./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27019 –threads $threads

./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27020 –threads $threads

./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://10.66.12.206:27021 –threads $threads

*YCSB from github at commit 5ab241

1x10GbE 1x10GbE

1x10GbE

1GbE

Page 10: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

10 © AppliedMicro Proprietary & Confidential

Single Moonshot m400 Cartridge =

~50% performance of a full 1U 2P

Intel Xeon® E5 Haswell server

MongoDB Performance with YCSB Real World In-Memory Database Workload

36 48

89

109

55

100

188

252

0

50

100

150

200

250

300

1 thread 2 threads 5 threads 10 threads

Pe

rfo

rman

ce o

ps/

sec

(Th

ou

san

ds)

Throughput

HP Moonshot m400

2P E5-2630v3 Haswell

25

48

90 100

7 14

30 38

0

20

40

60

80

100

120

1 thread 2 threads 5 threads 10 threads

CP

U U

tiliz

atio

n (

%)

CPU Utilization

HP Moonshot m400

2P E5-2630v3 Haswell

Page 11: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

11 © AppliedMicro Proprietary & Confidential

MongoDB Performance with YCSB Real World In-Memory Database Workload

42

U R

ack

9 M

oo

nsh

ot m

40

0 ch

assis/ra

ck 4

0 2

P E5

-26

30

v3 H

asw

ell 1U

servers/rack

Rack-level Scalability =

5x the Performance versus full 1U 2P

Intel Xeon® E5 Haswell rack

15

19

36

44

2 4

8 10

0

5

10

15

20

25

30

35

40

45

50

1 thread 2 threads 5 threads 10 threads

Pe

rfo

rman

ce o

ps/

sec

(Mill

ion

s)

Rack Level Throughput

HP Moonshot m400

2P E5-2630v3 Haswell

Page 12: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

12 © AppliedMicro Proprietary & Confidential

1U / 2P Rack Server Intel™ Xeon® E5-2630v3

• 16C/32T 2.4GHz Turbo/HT • 64GB DDR4-1866

2-port 10GbE Mellanox NIC CentOS 7.1

24-port 10GbE Netgear™ Switch

Client Intel™ Xeon® 2P e5-2630v2 • 12C/24T 2.6GHz Turbo /HT • 64GB DDR3-1600 2-port 10GbE Mellanox NIC CentOS 7.1

HP Moonshot m400 AppliedMicro X-Gene® CPU

• 8-Core 2.4GHz • 64GB DDR3-1333

10GbE Integrated Ethernet RHELSA 7.1

Hardware Topology

• Server Command • PostgreSQL version 9.4.4 • compile options ./configure Run options su postgres -c '/usr/local/pgsql/bin/postgres -F -D /home/postgres/data -p 5432 &'

Postgres Performance with BenchmarkSQL Real World In-Memory Database Workload

PostgreSQL Database Contents 32 Warehouses with 100,000 parts inventory/warehouse 10 districts/warehouse 3000 customers/district 1 terminal/district = 1 operator/district to serve 3000 customers Total Customers = 32*10*3000 = 960,000 PostgreSQL Database Operations New Order = 45% Payment = 43% Order Status = 4% Delivery = 4% Stock Level = 4%

• Benchmark Command • BenchmarkSQL version 4.1.0 cd ~/benchmarksql-4.1.0/run ./runBenchmark.sh props.pg progs.pg file content driver=org.postgresql.Driver conn=jdbc:postgresql://10.76.191.182:5432/postgres user=benchmarksql password=amcc1234 warehouses=32 terminals=8 runTxnsPerTerminal=0 runMins=2 limitTxnsPerMin=0 newOrderWeight=45 paymentWeight=43 orderStatusWeight=4 deliveryWeight=4 stockLevelWeight=4

1x10GbE 1x10GbE

1x10GbE

1GbE

Page 13: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

13 © AppliedMicro Proprietary & Confidential

Single Moonshot m400 Cartridge =

~90% performance of a full 1U 2P

Intel Xeon® E5 Haswell server

Postgres Performance with BenchmarkSQL Real World In-Memory Database Workload

8 15

25

38

52

70 69

4 6 10

18

36

76

120

0

20

40

60

80

100

120

140

1 2 4 8 16 32 64

Tran

sact

ion

s P

er

Min

ute

(Th

ou

san

ds)

Terminals (1 terminal = 1operator serving 3000 customers/district)

Throughput

HP Moonshot m400

2P E5-2630v3 Haswell

8% 17%

30%

49%

70%

97% 100%

1% 2% 3% 5% 10% 18%

28%

0%

20%

40%

60%

80%

100%

120%

1 2 4 8 16 32 64

CP

U U

tiliz

atio

n (

%)

Terminals (1 terminal = 1operator serving 3000 customers/district)

CPU Utilization

HP Moonshot m400

2P E5-2630v3 Haswell

Page 14: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

14 © AppliedMicro Proprietary & Confidential

Rack Level Scalability =

8X-9X the performance versus full 1U 2P Intel Xeon® E5 Haswell rack

Postgres Performance with BenchmarkSQL Real World In-Memory Database Workload

3170

6066

10106

15255

21171

28329 28143

146 222 381 724 1458

3023

4783

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32 64

Tran

sact

ion

s P

er

Min

ute

(Th

ou

san

ds)

Terminals (1 terminal = 1 operator serving 3000 customers/district)

Rack Level Throughput

HP Moonshot m400

2P E5-2630v3 Haswell

42

U R

ack

9 M

oo

nsh

ot m

40

0 ch

assis/ra

ck 4

0 2

P E5

-26

30

v3 H

asw

ell 1U

servers/rack

Page 15: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

15 © AppliedMicro Proprietary & Confidential

X-Gene® Technology in the Data Center

Page 16: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

16 © AppliedMicro Proprietary & Confidential

Moor Insights and Strategy Paper “The First Enterprise Class 64-Bit ARMv8 Server: HP Moonshot System’s HP ProLiant m400 Server Cartridge”

• HP ProLiant™ m400 “Moonshot” Cartridge

– Production shipments as of 4Q, 2014

• AppliedMicro X-Gene® Processor

– First production 64-bit ARMv8-based server SoC

– Server-class performance with mobile efficiency

• Web Tier/Caching Solution

– Service providers and commercial internet providers

– Early adopters of ARM servers

– ARM server/mobile software development community

Copyright ©2014 Moor Insights & Strategy All Rights Reserved

Moor Insights and Strategy Whitepaper outlining product details and

TCO analysis: http://www.moorinsightsstrategy.com/?p=4753

35% Lower TCO for Scale-Out Web-Tier / Caching Environments

Page 17: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

17 © AppliedMicro Proprietary & Confidential

X-Gene® Technology in the Public Cloud

"This is probably the first example of Moonshot AArch64 running in

Europe outside of HP’s development labs, and certainly the first

example of generally available Moonshot backed AArch64 instances in

an OpenStack public cloud anywhere in the world.”

Dr. Mike Kelly, CEO and founder of DataCentred

Page 18: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

18 © AppliedMicro Proprietary & Confidential

Ethernet Switch

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

X-Gene

D I M M

D I M M

D I M M

D I M M

Xeon E5-2660 v3

Xeon E5-2660 v3

IO Chipset

10G NIC

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

D I M M

Xeon ® 1U Server X-Gene ® 1U Server

No of Xeon® vCPUs 32 No of X-Gene® 2 ARM vCPUs 64

General Purpose Instances 1x m3.medium, m3.large, m3.xlarge

General Purpose Instances 2x m3.medium, m3.large, m3.xlarge

Memory Optimized Instances 1x r3.large, r3.xlarge, r3.2xlarge

Memory Optimized Instances 2x r3.large, r3.xlarge, r3.2xlarge

X-Gene® 2 vCPU: 2x density per 1U server

Page 19: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

19 © AppliedMicro Proprietary & Confidential

X-Gene® 2 v/s Intel Xeon® : Rack Level TCO 2x density; 45-50% Instance Cost Reduction

Xeon® Rack X-Gene® 2 Rack

Rack Power: 13KW Rack Power: 13KW

General Purpose Instances

Rack Cost: $145K Rack Cost: $145K

No of vCPUs: 1280 No of vCPUs: 2560

Instance Cost: 1x Instance Cost: 0.5x

Rack Power: 14KW Rack Power: 16KW

Memory Optimized Instances

Rack Cost: $170K Rack Cost: $190K

No of vCPUs: 1280 No of vCPUs: 2560

Instance Cost: 1x Instance Cost: 0.55x

X-Gene® 2 ARM Instances at 45 – 50% lower prices than Xeon® Instances

Page 20: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

20 © AppliedMicro Proprietary & Confidential

X-Gene® Platform Ecosystem

Page 22: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

22 © AppliedMicro Proprietary & Confidential

NFV Update

SAE PGW SGW

GTP IP

OVS

Open Source

Linux

X-Gene 1

Evaluation Platform

Evolved Packet Core

(EPC)

Tieto TIP Stack

APM Software

Traffic Generator Traffic

Generator GUI

• AppliedMicro Working with Tieto to Port TIP Stack to X-Gene

• Demo of EPC and TIP using OVS with Traffic Generator

• Multiple Networking Functions Virtualized

– SAE Core

– Packet Data Network Gateway

– Serving Gateway

• Schedule – Completion December, 2015

Page 24: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

24 © AppliedMicro Proprietary & Confidential

X-Gene® 1 Platform Ecosystem Multiple SKUs from Leading OEM and ODM Partners

Page 25: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

25 © AppliedMicro Proprietary & Confidential

AppliedMicro Genome™ Platform New Platform for X-Gene® 2 Technology

• Memory coherent framework for scale-out computing

– Enables the positives of scale-up in a scale-out platform

– Utilizes PCIe and interconnect 10 GbE fabric

– Bare metal hypervisor software approach

• Avoids hardware SMP complexity and cost

• Benefits

– More performance per node to address more workloads

• More cores & more memory

• 4 or more X-Gene processors per node

– Single IP address across all processors in a node

– Single Operating System image across all processors in a node

– No software modification required

Page 26: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

26 © AppliedMicro Proprietary & Confidential

Genome™ Coherent Solution

Genome™ RDMA / PCIe® Low Latency Coherent Fabric

8-core X-Gene 2

32 GB DDR3

Application Software

32-core SMP Linux

8-core X-Gene 2

32 GB DDR3

8-core X-Gene 2

32 GB DDR3

8-core X-Gene 2

32 GB DDR3

PCIe Gen3 Fabric

1TB

10 Gigabit Ethernet Fabric

10 GbE

Switch

PCIe

Switch

2 x 10GbE

to TOR

Switch

BMC

Scalable performance across multiple sockets

Customizable for each application – configurable nodes based on need

Lowest power and price for the target application

½ width 1U, 4 node reference implementation

Options: more nodes, more memory

Page 27: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

27 © AppliedMicro Proprietary & Confidential

X-Gene® 2 Genome™ Development Platform Gryphon

• Platform – 4 X-Gene 2 processor nodes

– Up to 2.8 GHz

– 2 DDR3 channels / node • Up to 64 Gbytes / node (2 DIMMs)

• DDR3-1866

• Form Factor – 1U ½ width

– Dimensions: • Depth: 27.5” (698mm)

• Width: 17.64” (448mm)

• Height: 1.75” (44.5mm)

• Features – 1 SATA HDD/SSD / node

– 16-port 10GbE switch / sled • 2x10G XFI to top-of-rack switch

– 1GbE management port / sled

– 8-port PCI-Express® Gen 3 switch

– BMC • IPMI v2.0

– On board thermal sensors

• Power – 220 Watts

Page 28: [OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene

28 © AppliedMicro Proprietary & Confidential

Thank You!