25
PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

  • Upload
    corine

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers. K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST. Background. “Peta” is tremendous! Compared with “Giga or Tera” scale machines. - PowerPoint PPT Presentation

Citation preview

Page 1: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

PSI-SIM: System Performance Evaluation Environment for

Next-Generation Supercomputers

K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M.

Aoyagi

Kyusyu University, ISIT, IST

Page 2: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Background

• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines

How are you Mr. Tera?

I am fine!How about you, Mr. Peta?

Page 3: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Background

• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines

• If you would like to develop a “Peta-Scale” supercomputer, it is required to…– Explore the design space both of computation

nodes and inter-connection network!– Verify the effective performance to be achieved!

• So, we need a performance evaluation environment for peta-scale supercomputers!

Page 4: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Our Goal!

• Problem…– Simulations are 3-orders of magnitude slower

than real machines!– “Peta-scale” is 3-orders of magnitude larger

than “Tera-scale” (i.e. available machines) ! – How can we bridge the gap?

• Develop an efficient performance evaluation environment: PSI-SIM– Divide compute-node simulations and network

simulations!– Abstract the target application program to

accelerate simulation speed!

Page 5: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 6: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 7: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

What is the Skeleton Code?Original code

foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F

for (j=0; j<n; j++) for (k=0; k<n; k++) Func( );}

foo( ) {

BSIM_ADD_TIME(10ms)

MPI_Comm.

BSIM_ADD_TIME(1ms)

BSIM_ADD_TIME(15s)

}

Skeleton code

• Computation blocks are replaced by “Estimated” execution times!• Other modifications (e.g. reducing required memory size)

Page 8: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 9: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Generating Communication Profile

• BSIM-Logger– Executes the skeleton code on an existing

machine– Emulates the behavior of target machine– Generates a communication profile under the

assumption of a ZERO-latency ideal network

• Why Fast?– Abstracted computation blocks are NOT

executed (just update virtual timers)– Mask real communications, but generate

accurate logs

Page 10: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

How Fast? How Accurate?ERI (Electron Repulsion Integral)

Tim

e f

or

loggin

g (

s) Original

Skeleton

Exe.

Tim

e P

redic

ted (

s)

Original Skeleton

NAS PARALLEL FT

Tim

e f

or

loggin

g (

s) Original

Skeleton

Exe.

Tim

e P

redic

ted (

s)

OriginalSkeleton

Page 11: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 12: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Fast, Flexible Interconnection Network Simulator

• NSIM– Inputs the communication profile and a

network configuration file– Generates a communication profile with

estimated interconnect latency

• Why Fast? Why Flexible?– Parallelized implementation– Support a number of parameters

• Topology , Spec. of routers/switches, buffer size, and so on

Page 13: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance of BSIM + NSIM

• Performance prediction for HPL execution @16nodes PC cluster

• <120s (problem size = 5,000) @8CPU• About 9,000 MPI-Comm./s@8CPU

Execu

tion T

ime (

s) Measured Predicted

Error=5.3%

Not skeleton execution

Page 14: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine

Page 15: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

ANA GroupWork Viewer

Group Work•Indicate load balance

Performance Indicator•Execution time after load-balance optimization

Communication Indicator•Amount of communications per second

Page 16: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Conclusions

• PSI-SIM– Performance evaluation environment for

supercomputers– BSIM+NSIM+ANA

• On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a

“Giga-scale” machine (e.g. 32 CPU Cores)– “Peta-Scale” machine (4K PSI-SIMD CPUs) by

using a “Giga-scale” machine

Page 17: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Backup Slides

Page 18: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Peta-scale Performance Prediction

• Assumption– HPL problem size: 3Million– #of nodes: 4K (PSI-SIMD)– BSIM: use 32 cpus (3GHz Xeon)– NSIM: 10,000 MPI-Comm./s@8CPU

• How long we need to spend?– BSIM: about 300h (<2 weeks)– NSIM: about ??

• under the estimation…

Page 19: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

予測実行時間 (FT)

誤差 -11.6%

誤差 -11.3%

Target machine?: rsccUsed machine?: rscc

Page 20: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

通信プロファイル時間 (FT)

86%削減

19%削減

Target machine?: rsccUsed machine?: rscc

Page 21: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

予測実行時間( ERI)

誤差 -0.2%

誤差 1.5%

誤差 -0.6%

Target machine?: rsccUsed machine?: rscc

Page 22: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

通信プロファイル生成時間( ERI)

91%削減

96%削減

97%削減

Target machine?: rsccUsed machine?: rscc

Page 23: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

実行時間の予測性能通信レイテンシ

評価アプリケーションの規模増加 ⇒ 予測精度が向上評価アプリケーションの規模増加 ⇒ 予測精度が向上

予測精度: 94.7%

Page 24: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

シミュレーション時間(問題サイズ固定: 2000 )

評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上

最近の成果(高速化)分

16プロセス

256プロセス

1,024プロセス

Page 25: PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

Performance of NSIM

Accuracy : 94.7%

7.92,8.36,8.04

114sTarget machine? : PSI-hexaUsed machine?: PSI-hexa