Upload
corine
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers. K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M. Aoyagi Kyusyu University, ISIT, IST. Background. “Peta” is tremendous! Compared with “Giga or Tera” scale machines. - PowerPoint PPT Presentation
Citation preview
PSI-SIM: System Performance Evaluation Environment for
Next-Generation Supercomputers
K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu, and M.
Aoyagi
Kyusyu University, ISIT, IST
Background
• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines
How are you Mr. Tera?
I am fine!How about you, Mr. Peta?
Background
• “Peta” is tremendous!– Compared with “Giga or Tera” scale machines
• If you would like to develop a “Peta-Scale” supercomputer, it is required to…– Explore the design space both of computation
nodes and inter-connection network!– Verify the effective performance to be achieved!
• So, we need a performance evaluation environment for peta-scale supercomputers!
Our Goal!
• Problem…– Simulations are 3-orders of magnitude slower
than real machines!– “Peta-scale” is 3-orders of magnitude larger
than “Tera-scale” (i.e. available machines) ! – How can we bridge the gap?
• Develop an efficient performance evaluation environment: PSI-SIM– Divide compute-node simulations and network
simulations!– Abstract the target application program to
accelerate simulation speed!
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
What is the Skeleton Code?Original code
foo( ) { Inst. Block A for (i=0;i<n;i++) { Inst. Block B if (hoge) { Inst. Block C } else { Inst. Block D } Inst. Block E } MPI_Comm. Inst. Block F
for (j=0; j<n; j++) for (k=0; k<n; k++) Func( );}
foo( ) {
BSIM_ADD_TIME(10ms)
MPI_Comm.
BSIM_ADD_TIME(1ms)
BSIM_ADD_TIME(15s)
}
Skeleton code
• Computation blocks are replaced by “Estimated” execution times!• Other modifications (e.g. reducing required memory size)
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
Generating Communication Profile
• BSIM-Logger– Executes the skeleton code on an existing
machine– Emulates the behavior of target machine– Generates a communication profile under the
assumption of a ZERO-latency ideal network
• Why Fast?– Abstracted computation blocks are NOT
executed (just update virtual timers)– Mask real communications, but generate
accurate logs
How Fast? How Accurate?ERI (Electron Repulsion Integral)
Tim
e f
or
loggin
g (
s) Original
Skeleton
Exe.
Tim
e P
redic
ted (
s)
Original Skeleton
NAS PARALLEL FT
Tim
e f
or
loggin
g (
s) Original
Skeleton
Exe.
Tim
e P
redic
ted (
s)
OriginalSkeleton
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
Fast, Flexible Interconnection Network Simulator
• NSIM– Inputs the communication profile and a
network configuration file– Generates a communication profile with
estimated interconnect latency
• Why Fast? Why Flexible?– Parallelized implementation– Support a number of parameters
• Topology , Spec. of routers/switches, buffer size, and so on
Performance of BSIM + NSIM
• Performance prediction for HPL execution @16nodes PC cluster
• <120s (problem size = 5,000) @8CPU• About 9,000 MPI-Comm./s@8CPU
Execu
tion T
ime (
s) Measured Predicted
Error=5.3%
Not skeleton execution
Performance-Evaluation Flowof PSI-SIM
BSIM-Parser
BSIM-Logger
Comm. profile (w/o Latency)
Comm. Profile(w/ Latency)
ANA
Performance Info.
InterconnectConfiguration
DB for Processors
Interconnect Arch.
VisualizationHints for Optimization
Parallelized Application
(e.g. Peta-scale)
SkeletonCode
Step1: Generate a skeleton code
Step2: Execute on an existing machine
Step3: Simulate inter connection network
Step4: Visualize and analyze the results
NSIM
Target machine
Target machine
ANA GroupWork Viewer
Group Work•Indicate load balance
Performance Indicator•Execution time after load-balance optimization
Communication Indicator•Amount of communications per second
Conclusions
• PSI-SIM– Performance evaluation environment for
supercomputers– BSIM+NSIM+ANA
• On Going Work: Performance Prediction for – “Tera-Scale” machine (1K CPU Cores) by using a
“Giga-scale” machine (e.g. 32 CPU Cores)– “Peta-Scale” machine (4K PSI-SIMD CPUs) by
using a “Giga-scale” machine
Backup Slides
Peta-scale Performance Prediction
• Assumption– HPL problem size: 3Million– #of nodes: 4K (PSI-SIMD)– BSIM: use 32 cpus (3GHz Xeon)– NSIM: 10,000 MPI-Comm./s@8CPU
• How long we need to spend?– BSIM: about 300h (<2 weeks)– NSIM: about ??
• under the estimation…
予測実行時間 (FT)
誤差 -11.6%
誤差 -11.3%
Target machine?: rsccUsed machine?: rscc
通信プロファイル時間 (FT)
86%削減
19%削減
Target machine?: rsccUsed machine?: rscc
予測実行時間( ERI)
誤差 -0.2%
誤差 1.5%
誤差 -0.6%
Target machine?: rsccUsed machine?: rscc
通信プロファイル生成時間( ERI)
91%削減
96%削減
97%削減
Target machine?: rsccUsed machine?: rscc
実行時間の予測性能通信レイテンシ
評価アプリケーションの規模増加 ⇒ 予測精度が向上評価アプリケーションの規模増加 ⇒ 予測精度が向上
予測精度: 94.7%
シミュレーション時間(問題サイズ固定: 2000 )
評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上評価アプリケーションのプロセス数増加 ⇒ 並列処理効率が向上
最近の成果(高速化)分
16プロセス
256プロセス
1,024プロセス
Performance of NSIM
Accuracy : 94.7%
7.92,8.36,8.04
114sTarget machine? : PSI-hexaUsed machine?: PSI-hexa