PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers

PSI-SIM: System Performance Evaluation Environment for

Next-Generation Supercomputers

K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M.

Aoyagi

Kyusyu University, ISIT, IST

Background

• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines

How are you Mr. Tera?

I am fine!How about you, Mr. Peta?

Background

• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines

• If you would like to develop a “Peta-Scale” supercomputer, it is required to…– Explore the design space both of computation

nodes and inter-connection network!– Verify the effective performance to be achieved!

• So, we need a performance evaluation environment for peta-scale supercomputers!

Our Goal!

• Problem…– Simulations are 3-orders of magnitude slower

than real machines!– “Peta-scale” is 3-orders of magnitude larger

than “Tera-scale” (i.e. available machines) ! – How can we bridge the gap?

• Develop an efficient performance evaluation environment: PSI-SIM– Divide compute-node simulations and network

simulations!– Abstract the target application program to

accelerate simulation speed!

Performance-Evaluation Flowof PSI-SIM

BSIM-Parser

BSIM-Logger

Comm. profile (w/o Latency)

Comm. Profile(w/ Latency)

ANA

Performance Info.

InterconnectConfiguration

DB for Processors

Interconnect Arch.

VisualizationHints for Optimization

Parallelized Application

(e.g. Peta-scale)

SkeletonCode

Step1: Generate a skeleton code

Step2: Execute on an existing machine

Step3: Simulate inter connection network

Step4: Visualize and analyze the results

NSIM

Target machine

Target machine


BSIM-Parser

BSIM-Logger



ANA

Performance Info.


DB for Processors

Interconnect Arch.



(e.g. Peta-scale)

SkeletonCode





NSIM

Target machine

Target machine

What is the Skeleton Code?Original code

foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F

for (j=0; j<n; j++) for (k=0; k<n; k++) Func( );}

foo( ) {

BSIM_ADD_TIME(10ms)

MPI_Comm.

BSIM_ADD_TIME(1ms)

BSIM_ADD_TIME(15s)

}

Skeleton code

• Computation blocks are replaced by “Estimated” execution times!• Other modifications (e.g. reducing required memory size)


BSIM-Parser

BSIM-Logger



ANA

Performance Info.


DB for Processors

Interconnect Arch.



(e.g. Peta-scale)

SkeletonCode





NSIM

Target machine

Target machine

Generating Communication Profile

• BSIM-Logger– Executes the skeleton code on an existing

machine– Emulates the behavior of target machine– Generates a communication profile under the

assumption of a ZERO-latency ideal network

• Why Fast?– Abstracted computation blocks are NOT

executed (just update virtual timers)– Mask real communications, but generate

accurate logs

How Fast? How Accurate?ERI (Electron Repulsion Integral)

Tim

e f

or

loggin

g (

s) Original

Skeleton

Exe.

Tim

e P

redic

ted (

s)

Original Skeleton

NAS PARALLEL FT

Tim

e f

or

loggin

g (

s) Original

Skeleton

Exe.

Tim

e P

redic

ted (

s)

OriginalSkeleton


BSIM-Parser

BSIM-Logger



ANA

Performance Info.


DB for Processors

Interconnect Arch.



(e.g. Peta-scale)

SkeletonCode





NSIM

Target machine

Target machine

Fast, Flexible Interconnection Network Simulator

• NSIM– Inputs the communication profile and a

network configuration file– Generates a communication profile with

estimated interconnect latency

• Why Fast? Why Flexible?– Parallelized implementation– Support a number of parameters

• Topology , Spec. of routers/switches, buffer size, and so on

Performance of BSIM + NSIM

• Performance prediction for HPL execution @16nodes PC cluster

• <120s (problem size = 5,000) @8CPU• About 9,000 MPI-Comm./s@8CPU

Execu

tion T

ime (

s) Measured Predicted

Error=5.3%

Not skeleton execution


BSIM-Parser

BSIM-Logger



ANA

Performance Info.


DB for Processors

Interconnect Arch.



(e.g. Peta-scale)

SkeletonCode





NSIM

Target machine

Target machine

ANA GroupWork Viewer

Group Work•Indicate load balance

Performance Indicator•Execution time after load-balance optimization

Communication Indicator•Amount of communications per second

Conclusions

• PSI-SIM– Performance evaluation environment for

supercomputers– BSIM+NSIM+ANA

• On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a

“Giga-scale” machine (e.g. 32 CPU Cores)– “Peta-Scale” machine (4K PSI-SIMD CPUs) by

using a “Giga-scale” machine

Backup Slides

Peta-scale Performance Prediction

• Assumption– HPL problem size: 3Million– #of nodes: 4K (PSI-SIMD)– BSIM: use 32 cpus (3GHz Xeon)– NSIM: 10,000 MPI-Comm./s@8CPU

• How long we need to spend?– BSIM: about 300h (<2 weeks)– NSIM: about ??

• under the estimation…

予測実行時間 (FT)

誤差 -11.6%

誤差 -11.3%

Target machine?: rsccUsed machine?: rscc

通信プロファイル時間 (FT)

86%削減

19%削減


予測実行時間（ ERI）

誤差 -0.2%

誤差 1.5%

誤差 -0.6%


通信プロファイル生成時間（ ERI）

91%削減

96%削減

97%削減


実行時間の予測性能通信レイテンシ

評価アプリケーションの規模増加 ⇒ 予測精度が向上評価アプリケーションの規模増加 ⇒ 予測精度が向上

予測精度： 94.7%

シミュレーション時間（問題サイズ固定： 2000 ）

評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上

最近の成果（高速化）分

16プロセス

256プロセス

1,024プロセス

Performance of NSIM

Accuracy ： 94.7%

７．９２，８．３６，８．０４

１１４ｓTarget machine? ： PSI-hexaUsed machine?: PSI-hexa

Documents

PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers