28
Designing HPC Solutions Onur Celebioglu Dell Inc

Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Designing HPC Solutions

Onur Celebioglu

Dell Inc

Agenda

bull HPC Focus Areas

bull Performance analysis of HPC Components

ndash Compute

ndash Interconnect

ndash Accelerators

ndash And many more

bull Best Practices

bull Designing better HPC solutions

ndash Domain specific Appliances

HPC at Dell

bull Evaluate new HPC technologies and selectively adopt for Integration

bull Share our findings with the broader HPC community

bull Analyze decision points to obtain the optimal solution to the problem at hand

bull Decision Points include but not limited to

ndash Compute Performance

ndash Memory Performance

ndash Interconnect

ndash Accelerators

ndash Storage

ndash Power Energy Efficiency

ndash Software Stack

ndash Middleware

bull Focus Areas

ndash Define best practices by analyzing each and every component of an HPC cluster

ndash Use these best practices to develop plug and play solutions targeted at specific HPC verticals such as Life sciences Fluid Dynamics High frequency trading etc

Compute Memory amp Energy Efficiency

12G ndash Optimal BIOS Settings

-5

Pe

rf+

19

P

ow

er

Sa

vin

g

-8

Pe

rf+

22

P

ow

er

Sa

vin

g

-6

Pe

rf+

25

P

ow

er

Sa

vin

g

-12

P

erf

+2

4

Po

we

r S

av

ing

-11

P

erf

+1

6

Po

we

r S

av

ing

Sa

me

Pe

rf+

10

Po

we

r S

av

ing

-15

P

erf

+2

1

Po

we

r S

av

ing

-5

Pe

rf+

18

P

ow

er

Sa

vin

g

-7

Pe

rf+

26

P

ow

er

Sa

vin

g

-6

Pe

rf+

26

P

ow

er

Sa

vin

g

-9

Pe

rf+

23

P

ow

er

Sa

vin

g

-13

P

erf

+2

6

Po

we

r S

av

ing

-4

Pe

rf+

20

P

ow

er

Sa

vin

g

-10

P

erf

+2

5

Po

we

r S

av

ing

080

085

090

095

100

105

110

115

120

HPL Fluenttruck_poly_14m

Fluenttruck_111m

WRFconus_12k

NAMDstmv

MILCIntel input file

LUclass D

En

erg

y e

ffic

ien

cy

ga

ins

wit

h T

urb

o d

isa

ble

d(r

ela

tiv

e t

o T

urb

o e

na

ble

d)

DAPC Perf

Balanced configuration Performance focused Energy Efficient

configuration

Latency sensitive

System Profile Performance Per Watt

Optimized (DAPC)

Performance

Optimized

Custom Custom

CPU Power Mgmt System DBPM Max Performance System DBPM Max Performance

Turbo Boost Enabled Enabled Disabled Disabled

C States amp C1E Enabled Disabled Enabled Disabled

Monitor Mwait Enabled Enabled Enabled Disabled

Logical Processor Disabled Disabled Disabled Disabled

Node Interleaving Disabled Disabled Disabled Disabled

Ivy Bridge vs Sandy Bridge Single Node

bull E5-2670 8C 26 Ghz (SB) vs E5-2697 V2 12C 27 GHz (IVB)

46

25

16

3

12

26

37

0

5

10

15

20

25

30

35

40

45

50

HPL ANSYS Fluent LS-DYNA Simulia Abaqus612- S4B

Simulia Abaqus612- E6

LAMMPS MUMPS

Performance Gain with Ivy Bridge (12 core) over Sandy Bridge (8 core)

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 2: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Agenda

bull HPC Focus Areas

bull Performance analysis of HPC Components

ndash Compute

ndash Interconnect

ndash Accelerators

ndash And many more

bull Best Practices

bull Designing better HPC solutions

ndash Domain specific Appliances

HPC at Dell

bull Evaluate new HPC technologies and selectively adopt for Integration

bull Share our findings with the broader HPC community

bull Analyze decision points to obtain the optimal solution to the problem at hand

bull Decision Points include but not limited to

ndash Compute Performance

ndash Memory Performance

ndash Interconnect

ndash Accelerators

ndash Storage

ndash Power Energy Efficiency

ndash Software Stack

ndash Middleware

bull Focus Areas

ndash Define best practices by analyzing each and every component of an HPC cluster

ndash Use these best practices to develop plug and play solutions targeted at specific HPC verticals such as Life sciences Fluid Dynamics High frequency trading etc

Compute Memory amp Energy Efficiency

12G ndash Optimal BIOS Settings

-5

Pe

rf+

19

P

ow

er

Sa

vin

g

-8

Pe

rf+

22

P

ow

er

Sa

vin

g

-6

Pe

rf+

25

P

ow

er

Sa

vin

g

-12

P

erf

+2

4

Po

we

r S

av

ing

-11

P

erf

+1

6

Po

we

r S

av

ing

Sa

me

Pe

rf+

10

Po

we

r S

av

ing

-15

P

erf

+2

1

Po

we

r S

av

ing

-5

Pe

rf+

18

P

ow

er

Sa

vin

g

-7

Pe

rf+

26

P

ow

er

Sa

vin

g

-6

Pe

rf+

26

P

ow

er

Sa

vin

g

-9

Pe

rf+

23

P

ow

er

Sa

vin

g

-13

P

erf

+2

6

Po

we

r S

av

ing

-4

Pe

rf+

20

P

ow

er

Sa

vin

g

-10

P

erf

+2

5

Po

we

r S

av

ing

080

085

090

095

100

105

110

115

120

HPL Fluenttruck_poly_14m

Fluenttruck_111m

WRFconus_12k

NAMDstmv

MILCIntel input file

LUclass D

En

erg

y e

ffic

ien

cy

ga

ins

wit

h T

urb

o d

isa

ble

d(r

ela

tiv

e t

o T

urb

o e

na

ble

d)

DAPC Perf

Balanced configuration Performance focused Energy Efficient

configuration

Latency sensitive

System Profile Performance Per Watt

Optimized (DAPC)

Performance

Optimized

Custom Custom

CPU Power Mgmt System DBPM Max Performance System DBPM Max Performance

Turbo Boost Enabled Enabled Disabled Disabled

C States amp C1E Enabled Disabled Enabled Disabled

Monitor Mwait Enabled Enabled Enabled Disabled

Logical Processor Disabled Disabled Disabled Disabled

Node Interleaving Disabled Disabled Disabled Disabled

Ivy Bridge vs Sandy Bridge Single Node

bull E5-2670 8C 26 Ghz (SB) vs E5-2697 V2 12C 27 GHz (IVB)

46

25

16

3

12

26

37

0

5

10

15

20

25

30

35

40

45

50

HPL ANSYS Fluent LS-DYNA Simulia Abaqus612- S4B

Simulia Abaqus612- E6

LAMMPS MUMPS

Performance Gain with Ivy Bridge (12 core) over Sandy Bridge (8 core)

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 3: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

HPC at Dell

bull Evaluate new HPC technologies and selectively adopt for Integration

bull Share our findings with the broader HPC community

bull Analyze decision points to obtain the optimal solution to the problem at hand

bull Decision Points include but not limited to

ndash Compute Performance

ndash Memory Performance

ndash Interconnect

ndash Accelerators

ndash Storage

ndash Power Energy Efficiency

ndash Software Stack

ndash Middleware

bull Focus Areas

ndash Define best practices by analyzing each and every component of an HPC cluster

ndash Use these best practices to develop plug and play solutions targeted at specific HPC verticals such as Life sciences Fluid Dynamics High frequency trading etc

Compute Memory amp Energy Efficiency

12G ndash Optimal BIOS Settings

-5

Pe

rf+

19

P

ow

er

Sa

vin

g

-8

Pe

rf+

22

P

ow

er

Sa

vin

g

-6

Pe

rf+

25

P

ow

er

Sa

vin

g

-12

P

erf

+2

4

Po

we

r S

av

ing

-11

P

erf

+1

6

Po

we

r S

av

ing

Sa

me

Pe

rf+

10

Po

we

r S

av

ing

-15

P

erf

+2

1

Po

we

r S

av

ing

-5

Pe

rf+

18

P

ow

er

Sa

vin

g

-7

Pe

rf+

26

P

ow

er

Sa

vin

g

-6

Pe

rf+

26

P

ow

er

Sa

vin

g

-9

Pe

rf+

23

P

ow

er

Sa

vin

g

-13

P

erf

+2

6

Po

we

r S

av

ing

-4

Pe

rf+

20

P

ow

er

Sa

vin

g

-10

P

erf

+2

5

Po

we

r S

av

ing

080

085

090

095

100

105

110

115

120

HPL Fluenttruck_poly_14m

Fluenttruck_111m

WRFconus_12k

NAMDstmv

MILCIntel input file

LUclass D

En

erg

y e

ffic

ien

cy

ga

ins

wit

h T

urb

o d

isa

ble

d(r

ela

tiv

e t

o T

urb

o e

na

ble

d)

DAPC Perf

Balanced configuration Performance focused Energy Efficient

configuration

Latency sensitive

System Profile Performance Per Watt

Optimized (DAPC)

Performance

Optimized

Custom Custom

CPU Power Mgmt System DBPM Max Performance System DBPM Max Performance

Turbo Boost Enabled Enabled Disabled Disabled

C States amp C1E Enabled Disabled Enabled Disabled

Monitor Mwait Enabled Enabled Enabled Disabled

Logical Processor Disabled Disabled Disabled Disabled

Node Interleaving Disabled Disabled Disabled Disabled

Ivy Bridge vs Sandy Bridge Single Node

bull E5-2670 8C 26 Ghz (SB) vs E5-2697 V2 12C 27 GHz (IVB)

46

25

16

3

12

26

37

0

5

10

15

20

25

30

35

40

45

50

HPL ANSYS Fluent LS-DYNA Simulia Abaqus612- S4B

Simulia Abaqus612- E6

LAMMPS MUMPS

Performance Gain with Ivy Bridge (12 core) over Sandy Bridge (8 core)

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 4: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Compute Memory amp Energy Efficiency

12G ndash Optimal BIOS Settings

-5

Pe

rf+

19

P

ow

er

Sa

vin

g

-8

Pe

rf+

22

P

ow

er

Sa

vin

g

-6

Pe

rf+

25

P

ow

er

Sa

vin

g

-12

P

erf

+2

4

Po

we

r S

av

ing

-11

P

erf

+1

6

Po

we

r S

av

ing

Sa

me

Pe

rf+

10

Po

we

r S

av

ing

-15

P

erf

+2

1

Po

we

r S

av

ing

-5

Pe

rf+

18

P

ow

er

Sa

vin

g

-7

Pe

rf+

26

P

ow

er

Sa

vin

g

-6

Pe

rf+

26

P

ow

er

Sa

vin

g

-9

Pe

rf+

23

P

ow

er

Sa

vin

g

-13

P

erf

+2

6

Po

we

r S

av

ing

-4

Pe

rf+

20

P

ow

er

Sa

vin

g

-10

P

erf

+2

5

Po

we

r S

av

ing

080

085

090

095

100

105

110

115

120

HPL Fluenttruck_poly_14m

Fluenttruck_111m

WRFconus_12k

NAMDstmv

MILCIntel input file

LUclass D

En

erg

y e

ffic

ien

cy

ga

ins

wit

h T

urb

o d

isa

ble

d(r

ela

tiv

e t

o T

urb

o e

na

ble

d)

DAPC Perf

Balanced configuration Performance focused Energy Efficient

configuration

Latency sensitive

System Profile Performance Per Watt

Optimized (DAPC)

Performance

Optimized

Custom Custom

CPU Power Mgmt System DBPM Max Performance System DBPM Max Performance

Turbo Boost Enabled Enabled Disabled Disabled

C States amp C1E Enabled Disabled Enabled Disabled

Monitor Mwait Enabled Enabled Enabled Disabled

Logical Processor Disabled Disabled Disabled Disabled

Node Interleaving Disabled Disabled Disabled Disabled

Ivy Bridge vs Sandy Bridge Single Node

bull E5-2670 8C 26 Ghz (SB) vs E5-2697 V2 12C 27 GHz (IVB)

46

25

16

3

12

26

37

0

5

10

15

20

25

30

35

40

45

50

HPL ANSYS Fluent LS-DYNA Simulia Abaqus612- S4B

Simulia Abaqus612- E6

LAMMPS MUMPS

Performance Gain with Ivy Bridge (12 core) over Sandy Bridge (8 core)

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 5: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

12G ndash Optimal BIOS Settings

-5

Pe

rf+

19

P

ow

er

Sa

vin

g

-8

Pe

rf+

22

P

ow

er

Sa

vin

g

-6

Pe

rf+

25

P

ow

er

Sa

vin

g

-12

P

erf

+2

4

Po

we

r S

av

ing

-11

P

erf

+1

6

Po

we

r S

av

ing

Sa

me

Pe

rf+

10

Po

we

r S

av

ing

-15

P

erf

+2

1

Po

we

r S

av

ing

-5

Pe

rf+

18

P

ow

er

Sa

vin

g

-7

Pe

rf+

26

P

ow

er

Sa

vin

g

-6

Pe

rf+

26

P

ow

er

Sa

vin

g

-9

Pe

rf+

23

P

ow

er

Sa

vin

g

-13

P

erf

+2

6

Po

we

r S

av

ing

-4

Pe

rf+

20

P

ow

er

Sa

vin

g

-10

P

erf

+2

5

Po

we

r S

av

ing

080

085

090

095

100

105

110

115

120

HPL Fluenttruck_poly_14m

Fluenttruck_111m

WRFconus_12k

NAMDstmv

MILCIntel input file

LUclass D

En

erg

y e

ffic

ien

cy

ga

ins

wit

h T

urb

o d

isa

ble

d(r

ela

tiv

e t

o T

urb

o e

na

ble

d)

DAPC Perf

Balanced configuration Performance focused Energy Efficient

configuration

Latency sensitive

System Profile Performance Per Watt

Optimized (DAPC)

Performance

Optimized

Custom Custom

CPU Power Mgmt System DBPM Max Performance System DBPM Max Performance

Turbo Boost Enabled Enabled Disabled Disabled

C States amp C1E Enabled Disabled Enabled Disabled

Monitor Mwait Enabled Enabled Enabled Disabled

Logical Processor Disabled Disabled Disabled Disabled

Node Interleaving Disabled Disabled Disabled Disabled

Ivy Bridge vs Sandy Bridge Single Node

bull E5-2670 8C 26 Ghz (SB) vs E5-2697 V2 12C 27 GHz (IVB)

46

25

16

3

12

26

37

0

5

10

15

20

25

30

35

40

45

50

HPL ANSYS Fluent LS-DYNA Simulia Abaqus612- S4B

Simulia Abaqus612- E6

LAMMPS MUMPS

Performance Gain with Ivy Bridge (12 core) over Sandy Bridge (8 core)

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 6: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Ivy Bridge vs Sandy Bridge Single Node

bull E5-2670 8C 26 Ghz (SB) vs E5-2697 V2 12C 27 GHz (IVB)

46

25

16

3

12

26

37

0

5

10

15

20

25

30

35

40

45

50

HPL ANSYS Fluent LS-DYNA Simulia Abaqus612- S4B

Simulia Abaqus612- E6

LAMMPS MUMPS

Performance Gain with Ivy Bridge (12 core) over Sandy Bridge (8 core)

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 7: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Decision Processor selection Criteria Performance

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance across four nodes using multiple IVB processors

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 8: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Decision Processor selection Criteria Power

bull 2 x E5-2697-v2 27 GHz 12c 130W does the best in most cases

bull All tests done on fully subscribed 4 servers with FDR interconnect

Energy efficiency across four nodes using multiple IVB processors

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 9: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Decision Memory selection Criteria Performance

bull Dual Rank memory modules give best performance

bull All tests done on fully subscribed 4 servers with FDR interconnect

Performance drop when using single rank memory modules on 4 nodes

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 10: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Interconnect Performance

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 11: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

OSU Latency and Bandwidth (FDR vs 40 GigE RoCE)

bull How do benchmarks synthetic kernels and micro benchmarks behave at scale

bull Can micro benchmark performance explain applicationrsquos performance at a larger scale

137

163

12

125

13

135

14

145

15

155

16

165

FDR 40GigE RoCE

Late

ncy

(u

s)

MPI OSU latency

626273

49023

0

1000

2000

3000

4000

5000

6000

7000

FDR 40GigE RoCE

Ban

dw

idth

(M

Bs

)

MPI OSU Bandwidth

MVAPICH2-20b and OMB v42

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 12: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

RoCE vs IB vs TCP

1 1 1 1 1 1

10

0

10

1

10

2

10

2

10

0

10

1

10

0

10

1

10

1

09

5

09

4

09

3

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

GFl

op

s)

Nodes - Cores

HPL

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b

10

0

10

0

10

0

10

0

10

0

10

0

09

9

10

0

10

1

09

9

10

0

09

8

09

9

09

7

08

8

07

5

06

3

02

9

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

NPB LU Class D

FDR 40GigE-ROCE 40GigE-TCP

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 13: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

RoCE vs IB vs TCP

1 1 1 1 1 1

09

9

10

0

10

2

09

4

08

4

08

3

10

0

10

0

09

4

07

2

05

1

03

8

0

02

04

06

08

1

12

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

WRF Conus 12Km

FDR 40GigE-ROCE 40GigE-TCP

10

0

10

0

10

0

10

0

10

0

10

0

10

1

09

8

08

7

05

9

07

5

05

7

10

1

09

8

04

9

02

9

03

0 04

0

000

020

040

060

080

100

120

1 - 20 2 - 40 4 - 80 8 - 160 16 - 320 32 - 640

Per

form

ance

Rel

ativ

e to

FD

R (

Rat

ing)

Nodes - Cores

MILC Intel Data set

FDR 40GigE-ROCE 40GigE-TCP

MVAPICH2-20b WRF 35 MILC 762

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 14: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Interconnect Summary

bull InfiniBand is still performs higher than other network fabrics in this study for HPC workloads

bull For some workloads RoCE performs similar to InfiniBand and may be a viable alternative

ndash Havenrsquot seen wide adoption of RoCE in production yet

ndash Mileage will vary based on applicationrsquos communication characteristics

ndash Needs switches with DCB support for optimal lossless performance

bull Ethernet with TCPIP stops scaling after 4-8 nodes

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 15: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Accelerator Performance

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 16: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Power and Performance K20 vs K40

HPL performance on single-node

Power amp energy efficiency of an eight node cluster

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 17: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

MV2 performance with GPU Direct OMBDevice to Device Latency

Intel SandyBridge (E5-2670) NVIDIA Telsa K20m GPU Mellanox ConnectX-3 FDR CUDA 60OFED 22-100 with GPU Direct RDMA Beta

62

26x

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 18: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Domain specific solutions

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 19: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Dell Genomic Analysis Platform

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 20: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Dell Genomic Analysis Platform (Continued)

Parameter Results and Analysis

Time taken for analyzing 30 samples 195 Hours

Energy Consumption for analyzing 30

samples

22277 kWh

kWhGenome 742 kWh Genome

Genomesday 37

Advantages

bull Metrics relevant to the domain instead of GFLOPs

bull Energy Efficient

bull Plug and Play

bull Scalability

bull What used to take 2 weeks now takes less than 4 hours

bull More to follow

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 21: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Dell - Restricted - Confidential

Collateral

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 22: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Future Work and Potential Areas of Research

bull Deployment tools

bull Use of virtualization and cloud (Openstack) in HPC

ndash Linux Containers and Docker

bull Hadoop

bull Lustre FS

bull Accelerators

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 23: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Storage Blogsndash HTSS + DX Object Storage

rsaquo httpdelltozJqiTK

ndash Dell HPC NFS Storage Solution with High Availability -- Large Capacity Configuration

rsaquo httpdelltoGYWU5x

ndash Dell support for XFS greater than 100 TB

rsaquo httpdelltoGUjXRq

ndash NSS-HA 12G Performance Intro

rsaquo httpdelltoNFUafG

ndash NSS45-HA Solution Configurations

rsaquo httpdellto10xLxJV

ndash Dell Fluid Cache for DAS performance with NFS

rsaquo httpdellto15KnsDc

ndash Achieving over 100000 IOPs with NFS Async

rsaquo httpdellto16yE3bP

ndash Dell | Terascala HPC Storage Solution Part I

rsaquo httpwwwdelltechcentercompageDell+|+Terascala+HPC+Storage+Solution+Part+I

ndash Dell | Terascala HPC Storage Solution Part 2

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2336aspx

ndash DT-HSS3 Performance and Scalability

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2300aspx

23

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 24: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Storage Blogs Continuedndash Dell | Terascala HPC Storage Solution - HSS5

rsaquo httpdellto1gpVVyN

ndash NSS overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2338aspx

ndash NSS-HA overview

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2298aspx

ndash NSS-HA XL configuration

rsaquo httpencommunitydellcomtechcenterhigh-performance-computingwwiki2299aspx

ndash Dell HPC NFS Storage Solution - High Availability Solution NSS5-HA configurations

rsaquo httpdellto1eZU0xL

24

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 25: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Coprocessor Acceleration Blogsndash GPUDirect Improves Communication Bandwidth Between GPUs on the C410X

rsaquo httpdelltoApnLz5

ndash Comparing GPU-Direct Enabled Communication Patterns for Oil and Gas Simulations

rsaquo httpdelltoJsWqWT

ndash Accelerating ANSYS Mechanical Simulations with M2090 GPU on the R720

rsaquo httpdelltoJT79KF

ndash Accelerating High Performance Linpack (HPL) with GPUs

rsaquo httpdelltoMrYw8q

ndash Faster Molecular Dynamics with GPUs

rsaquo httpdelltoPEaFaF

ndash Deploying and Configuring Intel Xeon Phi Coprocessor with HPC Solution

rsaquo httpdellto14GtFRv

25

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 26: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Best Practices Blogsndash 12G HPC Solution with ROCKS+ from StackIQ

rsaquo httpdelltoxGmSHO

ndash HPC mode on Dell PowerEdge R815 with AMD 6200 Processors

rsaquo httpdelltoMMGG4s

ndash Optimal BIOS settings for HPC workloads

rsaquo httpdelltoPkkMG1

ndash CFD Primer

rsaquo httpdelltoUwJQum

ndash OpenFOAM

rsaquo httpdelltoRga3hS

ndash PowerEdge M420 with single Force10 MXL Switch

rsaquo httpdelltoZjnhjz

ndash Active Infrastructure for HPC Life Sciences

rsaquo httpdellto18eaDSJ

ndash Dell HPC Solution Refresh Intel Xeon Ivy Bridge-EP 1866 DDR3 memory and RHEL 64

rsaquo httpdellto18U3Aki

26

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 27: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Performance Blogsndash HPC IO performance using PCI-E Gen3 slots on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltowzdV0x

ndash HPC performance on the 12th Generation (12G) PowerEdge Servers

rsaquo httpdelltozozohn

ndash Unbalanced Memory Configuration Performance

rsaquo httpdelltoUQ1kQu

ndash Performance analysis of HPC workloads

rsaquo httpdelltoSTbE8q

27

Dell - Restricted - Confidential

Questions

Page 28: Designing HPC Solutions - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Time taken for analyzing 30 samples 19.5 Hours Energy Consumption for analyzing 30 samples 222.77

Dell - Restricted - Confidential

Questions