28
Efficient Performance™ 1 High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications Engineer Chelsio Communications [email protected]

High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 1

High-PerformanceGPUClustering:GPUDirectRDMAover40GbEiWARP

TomReuConsultingApplicationsEngineerChelsioCommunicationstomreu@chelsio.com

Page 2: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 2

• Leading10/40GbEadaptersolutionproviderforserversandstoragesystems• ~800Kportsshipped

• Highperformanceprotocolengine• 80MPPS• 1.5μsec• ~5M+IOPs

• Featurerichsolution• Mediastreaminghardware/software• WANOptimization,Security,etc.

• CompanyFacts• Foundedin2000• 150strongstaff

• R&DOffices• USA–Sunnyvale• India–Bangalore• China-Shanghai

ChelsioCorporateSnapshotLeaderinHighSpeedConvergedEthernetAdapters

MarketCoverage

Manufacturing

OilandGas Finance

Service/Cloud

Storage

Media

HPC

Security

OEMSnapshot

Page 3: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™

• Directmemory-to-memorytransfer• Allprotocolprocessinghandlingbythe

NIC• Mustbeinhardware

• ProtectionhandledbytheNIC• Userspaceaccessrequiresbothlocal

andremoteenforcement• Asynchronouscommunicationmodel

• Reducedhostinvolvement• Performance

• Latency-polling• Throughput

• Efficiency• Zerocopy• Kernelbypass(userspaceI/O)• CPUbypass

RDMAOverview

Performanceandefficiencyinreturnfornewcommunicationparadigm

ChelsioT5RNICChelsioT5RNIC

Page 4: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 4

• ProvidestheabilitytodoRemoteDirectMemoryAccessoverEthernetusingTCP/IP

• UsesWell-KnownIBVerbs• InboxedinOFEDsince2008• RunsontopofTCP/IP

• ChelsioimplementsiWARP/TCP/IPstackinsilicon• Cut-throughsend• Cut-throughreceive

• Benefits• Engineeredtouse“typical”Ethernet• NoneedfortechnologieslikeDCBorQCN

• NativelyRoutable• Multi-pathsupportatLayer3(andLayer2)• ItrunsonTCP/IP• MatureandProven• GoeswhereTCP/IPgoes(everywhere)

iWARPWhatisit?

Page 5: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 5

• iWARPupdatesandenhancementsaredonebytheIETFSTORM(StorageMaintenance)workinggroup

• RFCs• RFC5040ARemoteDirectMemoryAccessProtocol

Specification• RFC5041DirectDataPlacementoverReliable

Transports• RFC5044MarkerPDUAlignedFramingforTCP

Specification• RFC6580IANARegistriesfortheRDDPProtocols• RFC6581EnhancedRDMAConnectionEstablishment• RFC7306RemoteDirectMemoryAccess(RDMA)

ProtocolExtensions• Supportfromseveralvendors,Chelsio,Intel,QLogic

iWARP

Page 6: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 6

• SomeUseCases• HighPerformanceComputing• SMBDirect• GPUDirectRDMA• NFSoverRDMA• FreeBSDiWARP• HadoopRDMA• LustreRDMA• NVMeoverRDMAfabrics

iWARPIncreasingInterestiniWARPasoflate

Page 7: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 7

• It’sEthernet• WellUnderstoodandAdministered• UsesTCP/IP• MatureandProven• Supportsrack,cluster,datacenter,LAN/MAN/WANandwireless

• CompatiblewithSSL/TLS• Donotneedtouseanybolt-ontechnologieslike• DCB• QCN

• Doesnotrequireatotallynewnetworkinfrastructure• ReducesTCOandOpEx

iWARPAdvantagesoverOtherRDMATransports

Page 8: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™

iWARPvsRoCE

iWARP RoCENative TCP/IP over Ethernet, no different from NFS or HTTP

Difficult to install and configure - “needs a team of experts” - Plug-and-Debug

Works with ANY Ethernet switches Requires DCB - expensive equipment upgrade

Works with ALL Ethernet equipment Poor interoperability - may not work with switches from different vendors

No need for special QoS or configuration - TRUE Plug-and-Play

Fixed QoS configuration - DCB must be setup identically across all switches

No need for special configuration, preserves network robustness

Easy to break - switch configuration can cause performance collapse

TCP/IP allows reach to Cloud scale Does not scale - requires PFC, limited to single subnet

No distance limitations. Ideal for remote communication and HA

Short distance - PFC range is limited to few hundred meters maximum

WAN routable, uses any IP infrastructure RoCEv1 not routable. RoCE v2 requires lossless IP infrastructure and restricts router configuration

Standard for whole stack has been stable for a decade

ROCEv2 incompatible with v1. More fixes to missing reliability and scalability layers required and expected

Transparent and open IETF standards process Incomplete specification and opaque process

Page 9: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 9

• HighPerformancePurposeBuiltProtocolProcessor• Runsmultipleprotocols

• TCPwithStatelessOffloadandFullOffload• UDPwithStatelessOffload• iWARP• FCoEwithOffload• iSCSIwithOffload

• AlloftheseprotocolsrunonT5withaSINGLEFIRMWAREIMAGE• Noneedtoreinitializethecardfordifferentuses• Futureproofe.g.supportforNVMfyetpreserves

today’sinvestmentiniSCSI

Chelsio’sT5SingleASICdoesitall

Page 10: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 10

T5ASICArchitecture

▪ Singleprocessordata-flowpipelinedarchitecture

▪ Upto1Mconnections▪ ConcurrentMulti-Protocol

Operation

1G/10G/40GMAC

EmbeddedLayer2EthernetSwitch

Lookup,filteringandFirewallCut-ThroughRXMemory

Cut-ThroughTXMemory

Data-flowProtocolEngine

TrafficManager

ApplicationCo-ProcessorTX

ApplicationCo-ProcessorRX

DMAEn

gine

PCI-e

,X8,Gen

3

GeneralPurposeProcessor

OptionalexternalDDR3memory

1G/10G/40GMAC

100M/1G/10GMAC

100M/1G/10GMAC

On-ChipDRAMMemoryController

Singleconnectionat40Gb.LowLatency.

HighPerformancePurposeBuiltProtocolProcessor

Page 11: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 11

LeadingUnifiedWire™ArchitectureConvergedNetworkArchitecturewithall-in-oneAdapterandSoftware

Networking▪4x10GbE/2x40GbENIC▪FullProtocolOffload▪DataCenterBridging▪Hardwarefirewall▪WireAnalytics▪DPDK/netmap

HFT▪WireDirecttechnology▪Ultralowlatency▪Highestmessages/sec▪Wirerateclassification

Storage▪NVMe/Fabrics▪SMBDirect▪iSCSIandFCoEwithT10-DIX▪iSERandNFSoverRDMA▪pNFS(NFS4.1)andLustre▪NASOffload▪Disklessboot▪Replicationandfailover

Virtualization&Cloud▪Hypervisoroffload▪SR-IOVwithembeddedVEB▪VEPA,VN-TAGs▪VXLAN/NVGRE▪NFVandSDN▪OpenStackstorage▪HadoopRDMA

HPC▪iWARPRDMAoverEthernet▪GPUDirectRDMA▪LustreRDMA▪pNFS(NFS4.1)▪OpenMPI▪MVAPICH

MediaStreaming▪TrafficManagement▪VideosegmentationOffload▪Largestreamcapacity

SingleQualification–SingleSKUConcurrentMulti-ProtocolOperation

Page 12: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™

• IntroducedbyNVIDIAwiththeKeplerClassGPUs.AvailabletodayonTeslaandQuadroGPUsaswell.

• EnablesMultipleGPUs,3rdpartynetworkadapters,SSDsandotherdevicestoreadandwriteCUDAhostanddevicememory

• AvoidsunnecessarysystemmemorycopiesandassociatedCPUoverheadbycopyingdatadirectlytoandfrompinnedGPUmemory

• Onehardwarelimitation• TheGPUandtheNetworkdeviceMUSTsharethesame

upstreamPCIerootcomplex• AvailablewithInfiniband,RoCE,andnowiWARP

GPUDirectRDMA

Page 13: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 13

• Read/writeGPUmemorydirectlyfromnetworkadapter• Peer-to-peerPCIe

communication• BypasshostCPU• Bypasshostmemory

• Zerocopy• Ultralowlatency• Veryhighperformance• ScalableGPUpooling

• AnyEthernetnetworks

GPUDirectRDMAT5iWARPRDMAoverEthernetcertifiedwithNVIDIAGPUDirect

RNIC

LAN/Datacenter/WAN

Network

MEMORY MEMORY

PayloadNotifications

CPU

Payload

HostHost

CPU

Notifications

Packets Packets

GPU RNIC GPU

Page 14: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™

• ChelsioModules• cxgb4-Chelsioadapterdriver• iw_cxgb4-ChelsioiWARPdriver• rdma_ucm-RDMAUserSpaceConnectionManager

• NVIDIAModules• nvidia-NVIDIAdriver• nvidia_uvm-NVIDIAUnifiedMemory• nv_peer_mem-NVIDIAPeerMemory

ModulesrequiredforGPUDirectRMDAwithiWARP

Page 15: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

CaseStudies

Page 16: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 16

• GeneralPurposeParticlesimulationtoolkit

• Standsfor:HighlyOptimizedObject-orientedMany-particleDynamics-BlueEdition

• RunningonGPUDirectRDMA-WITHNOCHANGESTOTHECODE-ATALL!

• MoreInfo:www.codeblue.umich.edu/hoomd-blue

HOOMD-blue

Page 17: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 17

• 4Nodes• [email protected]• 64GBRAM• ChelsioT580-CR40GbAdapter• NVIDIATeslaK80(2GPUspercard)• RHEL6.5• OpenMPI1.10.0• OFED3.18• CUDAToolkit6.5• HOOMD-bluev1.3.1-9• Chelsio-GDR-1.0.0.0• CommandLine:$MPI_HOME/bin/mpirun --allow-run-as-root -mca btl_openib_want_cuda_gdr

1 -np X -hostfile /root/hosts -mca btl openib,sm,self -mca btl_openib_if_include cxgb4_0:1 --mca btl_openib_cuda_rdma_limit 65538 -mca btl_openib_receive_queues P,131072,64 -x CUDA_VISIBILE_DEVICES=0,1 /root/hoomd-install/bin/hoomd ./bmark.py --mode=gpu|cpu

HOOMD-blueTestConfiguration

Page 18: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 18

• ClassicbenchmarkforgeneralpurposeMDsimulations.• RepresentativeoftheperformanceHOOMD-blueachievesforstraightpairpotentialsimulations

HOOMD-blueLennard-JonesLiquid64KParticlesBenchmark

Page 19: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 19

HOOMD-blueLennard-JonesLiquid64KParticlesBenchmarkResults

AverageTimestepsperSecond

Test1

Test2

Test3

0 450 900 1350 1800

1,771

1,403

1,230

1,089

503

488

214

88

26

CPU GPUw/oGPUDirectRDMAGPUw/GPUDirectRDMA

LongerisBetter

2 CPU Cores2 GPUs

2 GPUs

8 CPU Cores4 GPUs

4 GPUs

40 CPU Cores8 GPUs

8 GPUs

Page 20: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 20

HOOMD-blueLennard-JonesLiquid64KParticlesBenchmarkResults

Hourstocomplete10e6steps

Test1

Test2

Test3

0 30 60 90 120

1.5

1.7

2.2

2.5

5.5

6

13

32

108

CPU GPUw/oGPUDirectRDMAGPUw/GPUDirectRDMA

ShorterisBetter

2 CPU Cores

8 CPU Cores

40 CPU Cores

2 GPUs

4 GPUs

8 GPUs

2 GPUs

4 GPUs

8 GPUs

Page 21: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 21

• runsasystemofparticleswithanoscillatorypairpotentialthatformsaicosahedralquasicrystal• Thismodelisusedintheresearcharticle:EngelM,et.al.(2015)Computationalself-assemblyofaone-componenticosahedralquasicrystal,Naturematerials14(January),p.109-116.

HOOMD-blueQuasicrystalBenchmark

Page 22: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 22

HOOMD-blueQuasicrystalresults

AverageTimestepsperSecond

Test1

Test2

Test3

0 300 600 900 1200

1,158

728

407

915

656

308

31

43

11

CPU GPUw/oGPUDirectRDMAGPUw/GPUDirectRDMA

LongerisBetter

2 CPU Cores2 GPUs

2 GPUs

8 CPU Cores4 GPUs

4 GPUs

40 CPU Cores8 GPUs

8 GPUs

Page 23: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 23

HOOMD-blueQuasicrystalresults

Hourstocomplete10e6steps

Test1

Test2

Test3

0 75 150 225 300

2.4

3.5

7

3

4

9

86

63

264

CPU GPUw/oGPUDirectRDMAGPUw/GPUDirectRDMA

ShorterisBetter

2 CPU Cores

8 CPU Cores

40 CPU Cores

2 GPUs

4 GPUs

8 GPUs

2 GPUs

4 GPUs

8 GPUs

Page 24: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™

• OpensourceDeepLearningsoftwarefromBerkeleyVisionandLearningCenter

• UpdatedtoincludeCUDAsupporttoutilizeGPUs• StandardversiondoesNOTincludeMPIsupport• MPIimplementations

• mpi-caffe• Usedtotrainalargenetworkacrossaclusterofmachines• model-paralleldistributedapproach.

• caffe-parallel• Fasterframeworkfordeeplearning.• data-parallelviaMPI,splitsthetrainingdataacrossnodes

CaffeDeepLearningFramework

Page 25: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 25

• iWARPprovidesRDMACapabilitiestoaEthernetnetwork

• iWARPusestriedandtrueTCP/IPasitsunderlyingtransportmechanism

• UsingiWARPdoesnotrequireawholenewnetworkinfrastructureandthemanagementrequirementsthatcomealongwithit

• iWARPcanbeusedwithexistingsoftwarerunningonGPUDirectRDMAwhichNOCHANGESrequiredtothecode

• ApplicationsthatuseGPUDirectRDMAwillseehugeperformanceimprovements

• Chelsioprovides10/40GbiWARPTODAYwith25/50/100Gbonthehorizon

SummaryGPUDirectRDMAover40GbEiWARP

Page 26: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

EfficientPerformance™ 26

• Visitourwebsite,www.chelsio.com,formoreWhitePapers,Benchmarks,etc.

• GPUDirectRDMAWhitePaper:http://www.chelsio.com/wp-content/uploads/resources/T5-40Gb-Linux-GPUDirect.pdf

• Webinar:https://www.brighttalk.com/webcast/13671/189427

• BetacodeforGPUDirectRDMAisavailableTODAYfromourdownloadsiteatservice.chelsio.com

[email protected][email protected]

MoreinformationGPUDirectRDMAover40GbEiWARP

Page 27: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

Questions?

27

Page 28: High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

ThankYou