37
IEEE EDS/SSCS Bangalore Chapter Talk: Performance Beyond Moore's Law: OpenPOWER Anand Haridass Senior Technical Staff Member Chief Engineer – POWER Integrated Solutions (BD&A) India Systems Development Lab [email protected]

Performance beyond moore's law

Embed Size (px)

Citation preview

Page 1: Performance beyond moore's law

IEEE EDS/SSCS Bangalore Chapter Talk:

Performance Beyond Moore's Law: OpenPOWER

Anand Haridass

Senior Technical Staff MemberChief Engineer – POWER Integrated Solutions (BD&A)

India Systems Development [email protected]

Page 2: Performance beyond moore's law

AgendaAgendaAgendaAgenda

� Technology Challenges – Classical CMOS Scaling

�‘Open Source’

�Open Compute Project

�OpenPOWER

�Collaborative Innovation

Page 3: Performance beyond moore's law

Processor Trends Processor Trends Processor Trends Processor Trends –––– Transistors

IEEE Micro 2010

Moore’’’’s Law Alive & Kicking

Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)”Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years”

Page 4: Performance beyond moore's law

Processor Trends Processor Trends Processor Trends Processor Trends –––– Transistors, Power

IEEE Micro 2010

Moore’s Law (Moore’s Law (Moore’s Law (Moore’s Law (1965196519651965))))”Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years”

‘‘‘‘Affordable’’’’ Air Cooled Limit ~120-190W

Page 5: Performance beyond moore's law

CMOS CMOS CMOS CMOS Power Power Power Power ---- Performance Performance Performance Performance ScalingScalingScalingScaling

10

100

0.01 0.1 1 10

Feature pitch (microns)

Re

lati

ve

Pe

rfo

rma

nc

e M

etr

ic

(Co

ns

t p

ow

er

de

ns

ity

)

14nmRegime

When scalingwas good…

Where this curve is flat, can only improve chip Where this curve is flat, can only improve chip Where this curve is flat, can only improve chip Where this curve is flat, can only improve chip frequency frequency frequency frequency by: by: by: by:

a) a) a) a) Pushing Pushing Pushing Pushing core/chip to higher power density (tough these days…)core/chip to higher power density (tough these days…)core/chip to higher power density (tough these days…)core/chip to higher power density (tough these days…)

b) b) b) b) Design Design Design Design power efficiency improvements (lowpower efficiency improvements (lowpower efficiency improvements (lowpower efficiency improvements (low----hanging fruit all gone)hanging fruit all gone)hanging fruit all gone)hanging fruit all gone)

Page 6: Performance beyond moore's law

CMOS Supply Voltage Scaling DifficultiesCMOS Supply Voltage Scaling DifficultiesCMOS Supply Voltage Scaling DifficultiesCMOS Supply Voltage Scaling Difficulties

0.1

1

0.01 0.1 1

Feature pitch (microns)

Vo

ltag

e (

V)

Classical Dennard

Scaling Regime

14nmRegime

Scaled voltage

High-performance voltage

Voltage“gap”

�Voltage scaling for high-performance designs is

limited

� Limited by leakage issues: can’t reduce

threshold voltages

� Limited by variability, esp VT variability

� Limited by gate oxide thickness (high-K relief)

�Limited voltage scaling + decreasing feature sizes �

Increasing electric fields

� New device structures needed (short channel

control)

� Reliability challenges (devices and wires)

Dennard Scaling : Performance per watt would grow at roughly the same rate as transistor density, doubling every 2 years. Power requirements are proportional to area (both voltage and current being proportional to length) for transistors. Transistor dimensions are scaled by 30% (0.7x)

every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x).

To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%.

Page 7: Performance beyond moore's law

Processor Trends Processor Trends Processor Trends Processor Trends –––– Transistors, Frequency, Power

IEEE Micro 2010

Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)”Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years”

Page 8: Performance beyond moore's law

Processor Trends Processor Trends Processor Trends Processor Trends –––– Transistors, Perf., Freq., Power

IEEE Micro 2010

Moore’s Law (Moore’s Law (Moore’s Law (Moore’s Law (1965196519651965))))”Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years”

Strongly Correlated

Page 9: Performance beyond moore's law

Processor Trends Processor Trends Processor Trends Processor Trends –––– Transistors, Perf., Freq., Power, Cores

IEEE Micro 2010

Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)”Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years”

Multi-Cores Packing them in

Page 10: Performance beyond moore's law

Processor Trends Processor Trends Processor Trends Processor Trends –––– Transistors, Perf., Freq., Power, Cores

IEEE Micro 2010

Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)Moore’s Law (1965)”Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years””Number of transistors in a dense integrated circuit doubles approximately every two years”

2 – 5 GHz

130 – 190 W

8 – 24 Cores

2015

4 – 6 Billion

High End uP

Page 11: Performance beyond moore's law

Materials Innovations : Increased ComplexityMaterials Innovations : Increased ComplexityMaterials Innovations : Increased ComplexityMaterials Innovations : Increased ComplexityElements Employed in Silicon TechnologyElements Employed in Silicon TechnologyElements Employed in Silicon TechnologyElements Employed in Silicon Technology

Since the 90’s

Beyond 2006

Before 90’s

Global Foundries projects that a

computer chip manufacturing

plant in NY would cost $14.7

billion to build

Page 12: Performance beyond moore's law

Continued Technology Continued Technology Continued Technology Continued Technology PPPPush …ush …ush …ush …

� New materials and devices to

extend core logic, memory, & I/O

technology roadmaps

�Continue silicon scaling

Scaling: 22, 14, 10, 7, 5 nm Nodes III / V Devices

Carbon DevicesMRAMSilicon Photonics

3D IntegrationPhase Change Memory

Page 13: Performance beyond moore's law

Moore’s Law 2015+Moore’s Law 2015+Moore’s Law 2015+Moore’s Law 2015+

Physics is not permitting performance gains by technology scaling;

however it is still enabling more transistors on a node to node basis

Relative %of Improvement

� # of Cores

� # of Threads

� Wider Execution Pipelines

� Integrated L1/L2/L3 Caches

� Memory Controllers

� CAPI

� Crypto Units

� Integrated accelerators ….

Cadence 1.5yrs months to 2 yrs to 2.5yrs ?

0%

20%

40%

60%

80%

100%

180nm 130nm 90nm 65nm 45nm 32nm

Gain by Traditional Scaling Gain by Innovation

Re

lati

ve

% I

mp

rov

em

en

t

0%

20%

40%

60%

80%

100%

180nm 130nm 90nm 65nm 45nm 32nm

Gain by Traditional Scaling Gain by Innovation

Re

lati

ve

% I

mp

rov

em

en

t

Page 14: Performance beyond moore's law

End user doesn’t care about Frequency / ST performance End user doesn’t care about Frequency / ST performance End user doesn’t care about Frequency / ST performance End user doesn’t care about Frequency / ST performance & other & other & other & other ‘‘‘‘processorprocessorprocessorprocessor’’’’ metricsmetricsmetricsmetrics

���� Cost/Performance is the metricCost/Performance is the metricCost/Performance is the metricCost/Performance is the metric

Processors

Semiconductor Technology

Page 15: Performance beyond moore's law

Driving Innovation Beyond The ChipDriving Innovation Beyond The ChipDriving Innovation Beyond The ChipDriving Innovation Beyond The Chip

Processors

Semiconductor Technology

System stack innovations are required to drive Cost/Performance

Applications and Services

Firmware, Operating System and Hypervisor

System Stack

Systems Management & Cloud Deployment

Systems Acceleration & HW/SW Optimization

Workload AccelerationServices Delivery ModelAdvanced Memory TechNetwork & I/O Acceleration

Use Cases

Microprocessors alone no longer drive sufficient Cost/Performance improvements

Processors

Semiconductor Technology

Page 16: Performance beyond moore's law

Open Source : SoftwareOpen Source : SoftwareOpen Source : SoftwareOpen Source : Software

Page 17: Performance beyond moore's law

Open Source : HardwareOpen Source : HardwareOpen Source : HardwareOpen Source : Hardware

Open Compute Project (2011) http://opencompute.org/

Page 18: Performance beyond moore's law

Open Source : HardwareOpen Source : HardwareOpen Source : HardwareOpen Source : Hardware

Page 19: Performance beyond moore's law

OCP – 150+ Members

Collaborate Contribute Consume

Page 20: Performance beyond moore's law

• Moore’s law no longer satisfies performance gain

• Growing workload demands

• Numerous IT consumption models

• Mature Open software ecosystem

OpenPOWEROpenPOWEROpenPOWEROpenPOWER

• Rich software ecosystem

• Spectrum of power servers

• Multiple hardware options

• Derivative POWER chips

The goal of the OpenPOWER Foundation is to create an open ecosystem, using

the POWER Architecture to share expertise, investment, and server-class intellectual property to serve the evolving needs of customers.

Performance of

POWER architecture

amplified capability

Open Development

open software, open hardware

Collaboration of

thought leaders

simultaneous innovation, multiple disciplines

Feeds back … Feeds back … Feeds back … Feeds back … resulting in client choiceresulting in client choiceresulting in client choiceresulting in client choice

New Open InnovationMarket Shifts

Founding Members

Page 21: Performance beyond moore's law

OpenPOWER Development CommunityOpenPOWER Development CommunityOpenPOWER Development CommunityOpenPOWER Development Community

Page 22: Performance beyond moore's law

August 2013 Announced Intent; December 2013 Incorporated with 5 members

OpenPOWER Community [160+ Members in 24+ Countries]

Page 23: Performance beyond moore's law

New Chips & Components

Components & Systems

New Systems & Platforms

Integrated Solutions

First Open server specification and motherboard combining OpenPOWER, Open Compute and OpenStack

First GPU-accelerated OpenPOWER developer platform

Prototype of a new high-performance server on the path to exascale

First commercially available OpenPOWER serverRedPower, the first China OpenPOWER

2-socket system coming in 2015 Inspur 2-socket POWER8 Server

ChuangHe China-branded OpenPOWER system with POWER8

Data Engine for NoSQL with 40TB CAPI-attached flash

Open Source Redis

Clustering 192 Vcores+ CAPI

40TB in 2U

First China “local” POWER derivative chip, CP1Convey’s CAPI developer kit based on the company’s Xilinix-based co-processors

DMI connection between an Altera Stratix V FPGA accelerator and a POWER8 CPU

First commercially available OpenPOWER third-party server

New CAPI-based solution: the ConnectX-4 adapter card by Mellanox

Nallatech’s OpenPOWER CAPI Developer Kit

24:1 Server consolidation for 3x lower cost per user

Page 24: Performance beyond moore's law

CAPP PCIe

POWER8 Processor

FPGA

Fu

nctio

n n

Fu

nctio

n 0

Fu

nctio

n 1

Fu

nctio

n 2

CAPI

IBM Supplied POWER Service

Layer

Typical I/O Model FlowTypical I/O Model FlowTypical I/O Model FlowTypical I/O Model Flow

Flow with a Coherent ModelFlow with a Coherent ModelFlow with a Coherent ModelFlow with a Coherent Model

Advantages of Coherent Attachment Over I/O AttachmentAdvantages of Coherent Attachment Over I/O AttachmentAdvantages of Coherent Attachment Over I/O AttachmentAdvantages of Coherent Attachment Over I/O Attachment

� Virtual Addressing & Data Caching (significant latency reduction)

� Easier, Natural Programming Model (avoid application restructuring, focus on workload rather than IO)

� Enables Apps Not Possible on I/O (Pointer chasing, shared mem semaphores, …)

What is What is What is What is Coherent Coherent Coherent Coherent Accelerator Processor Interface ?Accelerator Processor Interface ?Accelerator Processor Interface ?Accelerator Processor Interface ?

300 Instructions 10,000 Instructions 3,000 Instructions1,000 Instructions

1,000 Instructions

400 Instructions 100 Instructions

Page 25: Performance beyond moore's law

Collaborative Innovation Driving Price/PerformanceCollaborative Innovation Driving Price/PerformanceCollaborative Innovation Driving Price/PerformanceCollaborative Innovation Driving Price/Performance

Load Balancer

500GB Cache Node

10Gb Uplink

POWER8 Server

Flash Array w/ up to 40TB

After: NoSQL POWER8 + CAPI Flash

WWW

10Gb Uplink

WWW

Backup Nodes

500GB Cache Node500GB Cache

Node500GB Cache Node500GB Cache

Node

Before: NoSQL in memory

24U4U

Less is More

24:1 physical server consolidation =

6x less rack space

Infrastructure Requirements

- Large Distributed (Scale out)

- Large Memory per node

- Networking Bandwidth Needs

- Load Balancing

Acceptable

latency

CAPI

Memory

Conventional PCIe I/O

network

network

network

Page 26: Performance beyond moore's law

NVLinkNVLinkNVLinkNVLink InterconnectInterconnectInterconnectInterconnect

PCIe Connection GPU CPU

Graphics Memory

System MemoryGPU

POWER

Graphics Memory

System Memory

GPU

Graphics Memory

NV

Lin

k

Current GPU Attach Future NVLink GPU Attachment

16+16 GB/s

40

+4

0 G

B/s

Page 27: Performance beyond moore's law

Technical Computing Technical Computing Technical Computing Technical Computing OpenPOWER RoadmapOpenPOWER RoadmapOpenPOWER RoadmapOpenPOWER Roadmap

2015 2016 2017

POWER8 POWER8+ POWER9OpenPower

CAPI Interface NVLinkEnhanced

CAPI & NVLink

Connect-IBFDR Infiniband

PCIe Gen3

ConnectX-4EDR Infiniband

CAPI over PCIe Gen3

ConnectX-5Next-Gen Infiniband

Enhanced CAPI over PCIe Gen4

MellanoxInterconnect Technology

IBM CPUs

NVIDIA GPUsKepler

PCIe Gen3Volta

Enhanced NVLinkPascalNVLink

Systems

Page 28: Performance beyond moore's law

‘Open Sourced’ Cloud : Rackspace‘Open Sourced’ Cloud : Rackspace‘Open Sourced’ Cloud : Rackspace‘Open Sourced’ Cloud : Rackspace

Aaron Sullivan

Rackspace DE

Page 29: Performance beyond moore's law

Linux Distro Support For POWERLinux Distro Support For POWERLinux Distro Support For POWERLinux Distro Support For POWER

� RHEL 7

� Aavailable for existing RHEL

customers

� POWER8 (native mode) and

POWER 7/7+ at GA

� LE support in RHEL7.1

� Baremetal support in RHEL 7.2

� RHEL 6

• POWER8 supported with U5 (P7-

compatibility mode)

• Full support of POWER6 and

POWER7

(native mode)

� Fedora

• Fedora supports POWER, actively

develpped

• Fedora 20 has POWER8 support

� Supported add-ons

• JBoss

• High Performance Network Add-on

• More SW in future Built from the same source as xBuilt from the same source as xBuilt from the same source as xBuilt from the same source as x86868686

Delivered on the same schedule as xDelivered on the same schedule as xDelivered on the same schedule as xDelivered on the same schedule as x86868686

Supported at the same time as xSupported at the same time as xSupported at the same time as xSupported at the same time as x86868686

Close development relationship with IBMClose development relationship with IBMClose development relationship with IBMClose development relationship with IBM

� Ubuntu 15.04

� Docker container support

� Baremetal host support

� KVM support

� Ubuntu 14.04

� POWER8 enabled (native mode)

� No official support for POWER7+

and older systems

� No support for 32-bit

applications. 64-bit only.

� Supported in KVM only at this

time

� Baremetal / host supported as

tech preview, official support in

near future

� Supported add-ons

• Ubuntu openstack

• JuJu Charms

• MaaS (Metal as a Service)

• Landscape

� Debian

• Community enablement,

officially supported architecture

• Consensus distribution for

OpenPower

� SLES 12

� Baremetal and LE support

� SLES 11

• POWER8 supported with SP3 (P7-

compatibility mode)

• P7+ encryption, RNG accelerators

• Full support of POWER6 and

POWER7

(native mode)

� OpenSUSE

• OpenSuSE 12.2 re-launched with

IBM POWER

• OpenSuSE 13.2 includes Power8

support

� Supported add-ons

• SUSE Linux Enterprise High

Availability Extension

• More SW in future

Page 30: Performance beyond moore's law

Over 1,600 Linux ISVs developing on POWER

Big Data & Machine

Learning

Big Data & Machine

LearningCloud Mobile Enterprise

Major Linux Distros

HPC

miniDFT

CTH

BLAST

Bowtie

BWA

FASTA

HMMER

GATK

SOAP3

STAC-A2

SHOC

Graph500

Ilog

CHARMM

GROMACS

NAMD

AMBER

RTM

GAMESS

WRF

HYCOM

HOMME

LES

MiniGhost

AMG2013

OpenFOAM

Page 31: Performance beyond moore's law

UNICAMP Brazil, SAhttp://openpower.ic.unicamp.br/mini

cloud/index.html

Oregon State

North Americahttp://osuosl.org/services/powerdev

Brno University /

RedHat. Czech Republichttps://fit-rhlab.rhcloud.com

SuperVessel

Beijing, Chinawww.ptopenlab.com

IIT Bombay, India

1Q, 2016

HPC Center

University of Texas- TACC

3Q, 2015

• OpenPOWER Platforms• Open Stack Software• University research• Open Development &

Ecosystem Support

OpenPOWER Open Software and University Cloud Environments

Page 32: Performance beyond moore's law

The Road Ahead

Collaborative

Innovation

Page 33: Performance beyond moore's law

AcknowledgmentsAcknowledgmentsAcknowledgmentsAcknowledgments

� Rahul M. Rao

� James D Warnock

� Aditya Bansal

� Calista Flockhart

� Dipankar Sarma

� Mani Srinivasan

Page 34: Performance beyond moore's law

Thank You

Page 35: Performance beyond moore's law

Membership Level

Annual Fee$ USD

FTEs Technical Steering Committee Board / Voting position

Platinum $100k 10 One seat per member not otherwise represented

Includes board positionIncludes TSC position

Gold $60k 3 May be on TSC if Work group lead

Gold members may elect one board representative per three gold

members

Silver$20k

$5k if <300 employees

0 May be on TSC if Work group lead

Sliver members may elect one board representative for all silver members

Associate & Academic

$0 0 May be on TSC if Work group lead

May be elected to one community observer, non-voting Board seat

� The OpenPOWER Foundation is a Not-for-profit entity with a Board of Directors and a Technical Steering Committee.

• Membership levels provide either a default Board of Director position (Platinum) or an opportunity to be elected to the Board (Gold, Silver, and Assoc/Academic members). The Bylaws include additional governance detail.

• Technical Steering Committee is formed from Work group Leads and Platinum members.

�Membership options include Platinum, Gold, Silver, and Associate / Academic memberships• Annual fee and dedicated full-time equivalent (FTEs) - verification of FTEs on honor system• Contributors, committers, Work group leads and project leads influence Technical Steering Committee

• Associate / Academic level is not available to corporations

Membership agreement, Bylaws, and IP Rights Policy available for reviewwww.openpowerfoundation.org

Anyone may participate in OpenPOWER. Membership levels are designed for those that are investing to grow and enhance the OpenPOWER community and its proliferation within the industry.

Membership Options

Page 36: Performance beyond moore's law

OpenPOWER Work Group Roadmap

2014 2015 2016

Developer Platform

System SW

HW Architecture

Accelerator

Compliance

Proposed Work Groups Integrated Solutions

Pers Med

SP010 – Tyan OpenPOWER Customer Reference System CAPI – Coherent Accelerator Processor Interface

AFU – Accelerator Function UnitFSI – Field Replaceable Unit (FRU) Service Interface

25g IO Compatibility

Memory

OpenPOWER I/O

9 Work Groups

CharterCompliance SpecificationDraft Review WG Spec

CompSTD

Charter

OpenPOWER ISA Profile V1IO Device Architecture V2Coherent Accel Intf Arch

OpenPOWER ISA Profile V2IO Device Architecture V3Coherent Accel Intf Arch

CharterP8 SP010Data

P8 2U2SReference

P8+ 1U1SReference

P8+ 2U2SReference

OPMB Intf. Spec V1Charter

CharterCAPI AFU Intf Spec V1

OpenCL SDK

CAPI AFU Intf Spec V2

Charter

CAPI LinuxSDK

64b ABIPlatform Ref

Sys I/O Enablement GuideCharter

Charter 25g IO Spec

OPMB – OpenPOWER Memory Bus ABI – Application Binary Interface

FSI Specification FSI SpecCharter

SDK – Software Developer Kit

Page 37: Performance beyond moore's law

© 2015 OpenPOWER Foundation37

� License to the Power ISA will require a Micro-Architecture license (MAL), and will give the licensee the right to design and sell their own 64-bit Power ISA-compliant processor. The ISA specification itself is available freely online.

� License to the POWER8 design, which is a specific VHDL implementation of the Power ISA, and includes VHDL definition of the actual processor as a deliverable, will require an Implementation License. We can also offer derivative rights to this design if that is of interest.

� IBM can also license the POWER processor design flow, EDA tools and provide access via an IBM design center, as well as provide training on this design flow and tools. We can also offer design consultation services to assist licensees with custom designs.

� IBM can also license system-level IP to enable the licensee to build complete POWER8-based systems if that is of interest.

� Typically, these licenses include common terms:� All products must be POWER compliant� Non exclusive rights to the licensed information � Design and have designed rights� Make and have made rights� Rights to sell POWER CPU only in its own systems� Rights to market and sell POWER systems in clearly defined territory� Compliance with U.S. export regulations

POWER Licensing Models