36
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division National Center for Atmospheric Research Boulder, CO USA

Supercomputing Challenges at the National Center for Atmospheric Research

  • Upload
    jaunie

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Supercomputing Challenges at the National Center for Atmospheric Research. Dr. Richard Loft Computational Science Section Scientific Computing Division National Center for Atmospheric Research Boulder, CO USA. Talk Outline. Supercomputing Trends and Constraints - PowerPoint PPT Presentation

Citation preview

Page 1: Supercomputing Challenges at the National Center for Atmospheric Research

Supercomputing Challenges at the

National Center for Atmospheric

ResearchDr. Richard Loft

Computational Science Section

Scientific Computing Division

National Center for Atmospheric Research

Boulder, CO USA

Page 2: Supercomputing Challenges at the National Center for Atmospheric Research

Talk Outline

Supercomputing Trends and ConstraintsObserved NCAR Cluster Performance (Aggregate)Microprocessor efficiency: what is possible?Microprocessor efficiency: recent efforts to improve CAM2 performance.Some RISC/Vector Cluster ComparisonsConclusions

Page 3: Supercomputing Challenges at the National Center for Atmospheric Research

The Demand: High Cost of Science Goals

Climate scientists project a need for 150x more computing power of the next 5 years.T42->T85. Doubling horizontal resolution increases computational cost eightfold.Many additional constituents will be advected.New physics: computational cost of CAM/CCM, holding resolution constant, has increased 4x since 1996. More coming…Future: introducing super-parameterizations of moist processes would increase physics costs dramatically.

Page 4: Supercomputing Challenges at the National Center for Atmospheric Research

Existing Infrastructure Limits at NCAR

Cooling Capacity– 450 tons (1.58 megawatts)– Most limiting– One P690 node ~ 7.9 KW ~ 2.5 tons– Balance cooling with power

Power ~ 1.2 MW without modifications– Second most limiting– Currently NCAR computer room draws 602 KW– About 400 kw from IBM clusters

Space ~ 14,000 sq.ft.– P690 ~ 196 W/sq. ft.– Least limiting based on current trends

Page 5: Supercomputing Challenges at the National Center for Atmospheric Research

Mass Storage Growth

1.3 Pbytes total Adding ~3 Tbytes/day5 year doubling times -– Unique files: 2.1 years– File size: 10.4 years– Media performance (GB/$) 1.9 years

Alarming trends– MSS growth rate doubling time has

accelerated over past year. Now 18 months.

– MSS costs are increasing…

Page 6: Supercomputing Challenges at the National Center for Atmospheric Research

NCAR Mass Storage System Growth

1.0E+07

1.0E+08

1.0E+09

1.0E+10

01/02/0207/02/0201/02/0307/02/0301/02/0407/02/0401/02/0507/02/0501/02/0607/02/0601/02/0707/02/07

MBytes

unique_MB

total_MB

Page 7: Supercomputing Challenges at the National Center for Atmospheric Research
Page 8: Supercomputing Challenges at the National Center for Atmospheric Research
Page 9: Supercomputing Challenges at the National Center for Atmospheric Research

Observed Cluster Performance (Aggregate)

Page 10: Supercomputing Challenges at the National Center for Atmospheric Research

IBM Clusters at NCAR

Bluesky: 1024 IBM 1.3 GHz Power-4 cluster– 32 P690/32 compute servers– 736 in 92, 8 way “nodes” (bluesky8)– 288 in 9, 32 way “nodes” (bluesky32)– Peak: 5.234 TFlops– Dual “Colony” interconnect

Blackforest: IBM 375 MHz Power-3 cluster– 283 “winterhawk” 4-way SMP’s– Peak: 1.698 TFlops– TBMX interconnect

Page 11: Supercomputing Challenges at the National Center for Atmospheric Research

Observed IBM Cluster Efficiencies

System Application efficiency

(% of peak)

bluesk8 4.1%

bluesk32 4.5%

blackforest 5.7%

•Newer systems are less efficient.•Larger nodes are more efficient.•Max sustained performance: 320.3 GFlops

Page 12: Supercomputing Challenges at the National Center for Atmospheric Research

Why is workload efficiency low?

Computational character of workload average:– L3 cache miss rate 31%– computational intensity is 0.8

Applications are memory bandwidth limited.– Simple BW model predicts 5.5% for bluesky32.

A good metric of efficiency is Flop/cycle.– Factors out dual FPU’s.– Bluesky32: 0.18 Flop/cycle– Blackforest: 0.23 Flop/cycle

Page 13: Supercomputing Challenges at the National Center for Atmospheric Research

RISC Cluster Network Comparison

IBM Power-4 cluster with dual “Colony” network.IBM Power-3 cluster with single TBMX network.Compaq Alpha cluster with Quadrix network.Bisection Bandwidth– Important for global communications– XPAIR benchmark initiates all to all

communication.– Dual Colony P690 local:global BW ratio 50:1

Global Reductions – For P processors these should scale as log(P).– Actually scales linearly.

Page 14: Supercomputing Challenges at the National Center for Atmospheric Research

Cluster Network Performance

Page 15: Supercomputing Challenges at the National Center for Atmospheric Research
Page 16: Supercomputing Challenges at the National Center for Atmospheric Research

Microprocessor efficiency:

What is possible?

Page 17: Supercomputing Challenges at the National Center for Atmospheric Research

Example: 3-D FFT Performance

Hand tuned multithreaded, 3-D FFT (STK)Three 1-D FFT on each axis with transpositionsFFTs are memory bandwidth intensive– Both loads and Flop’s scale like N*log(N)

The FFT is not multiply-add dominated The FFT butterfly is a non local, strided calculation.Gets more non local as size of FFT increases1024^3 Transforms on P690 (IBM Power-4)

Page 18: Supercomputing Challenges at the National Center for Atmospheric Research
Page 19: Supercomputing Challenges at the National Center for Atmospheric Research
Page 20: Supercomputing Challenges at the National Center for Atmospheric Research
Page 21: Supercomputing Challenges at the National Center for Atmospheric Research

Microprocessor efficiency:

Recent efforts to improve CAM2 performance…

Page 22: Supercomputing Challenges at the National Center for Atmospheric Research

CCM Benchmark Performance on Existing Multiprocessor Clusters

Page 23: Supercomputing Challenges at the National Center for Atmospheric Research
Page 24: Supercomputing Challenges at the National Center for Atmospheric Research
Page 25: Supercomputing Challenges at the National Center for Atmospheric Research
Page 26: Supercomputing Challenges at the National Center for Atmospheric Research

Some RISC/Vector Cluster Comparisons…

Page 27: Supercomputing Challenges at the National Center for Atmospheric Research

Processor Comparison

Power 4(2 cores)

Pentium 4

Itanium II SX-6

Process .18 µ Cu/7l

.13 µ Cu

0.18 µ Al/ 6l

0.15 µ Cu/ 9l

Mhz 1300 2800 1000 500/1000

Peak GF 5.2 2.8/5.6 4.0 8.0

Die area 400 mm2 145 mm2

421 mm2 420 mm2

Trans. 170 M 55 M 221 M 57 M

On-Chip cache

1.77MB 512 KB 3.3 MB

On-Chip bandwidth

41 GB/s (per core)

89.6 GB/s

64 GB/s

Memory Bandwidth

5.8 GB/s 4.3 GB/s

6.4 GB/s 32 GB/s

Page 28: Supercomputing Challenges at the National Center for Atmospheric Research

IBM P690 Cluster

5.3 TFlops peak1024 processors (32, 32 way P690 nodes)5.2 Gflops/processorObserved 4.1-4.5% of peak on NCAR codesMax sustained on workload: 213.5 GFlopsEst. Peak Price Performance: $2.6/MFlopsSustained Price Performance: $59/MFlopsSustained Power Performance: 0.7 Gflops/KW

Page 29: Supercomputing Challenges at the National Center for Atmospheric Research

Earth Simulator

40.96 Tflops peak5120 Processors (640, 8 processor GS40 nodes)8 Gflops/processorEstimate 30% of peak on NCAR codes

Est. Max sustained on workload: 12,200 GFlopsEst. Peak Price Performance: $8.5/MFlopsEst. Sustained Price Performance: $28/MFlopsEst. Sustained Power Performance: 1.525 Gflops/KW

Page 30: Supercomputing Challenges at the National Center for Atmospheric Research

Power 4 die floor plan

Page 31: Supercomputing Challenges at the National Center for Atmospheric Research

Power 4 cache/CPU area comparison

Page 32: Supercomputing Challenges at the National Center for Atmospheric Research

Conclusions

Infrastructure (power, cooling, space) are becoming critical constraints.NCAR IBM clusters sustain 4.1%-4.5% of peak.Workload is memory bandwidth limited.RISC cluster interconnects are not great.We’re making steady progress learning how to program around these limitations.At this point, vector systems appear to be about 2x more cost effective in both price and power performance.

Page 33: Supercomputing Challenges at the National Center for Atmospheric Research

Pentium-4 die floor plan

Page 34: Supercomputing Challenges at the National Center for Atmospheric Research

Pentium-4 cache/CPU comparison

Page 35: Supercomputing Challenges at the National Center for Atmospheric Research

Itanium II die floor plan

Page 36: Supercomputing Challenges at the National Center for Atmospheric Research

Itanium II CPU/cache area comparison