Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
MiningSupercomputerJobs'I/OBehaviorfromSystemLogs
Xiaosong Ma
2
Rhea512 node
DevelopmentCluster
Eos736 Node
Cray XC30Cluster
Atlas1 Atlas2
Scalable IO Network (SION) - Infiniband
OSS
144 OSS Servers
OSSOSSOSS
OSSOSS
OSSOSS
1008 OST(LUN)
OSSOSS
OSSOSS
OSSOSS
OSSOSS
144 OSS Servers
OLCF Architecture Overview
1008 OST(LUN)
3
Rhea512 node
DevelopmentCluster
Eos736 Node
Cray XC30Cluster
Atlas1
MySQLdatabase
Atlas2
Scalable IO Network (SION) - Infiniband
OSS
144 OSS Servers
OSSOSSOSS
OSSOSS
OSSOSS
1008 OST(LUN)
Per-OST I/O throughput
OSSOSS
OSSOSS
OSSOSS
OSSOSS
144 OSS Servers
OLCF Architecture Overview
Monitoring tool
1008 OST(LUN)Server-side I/O throughput logs
4
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
5
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
6
I/O throughput logs
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
Zero overhead
No user effort
No impact on user IO
7
I/O throughput logs
Server-side I/O Throughput Logs
RAID controllerCoarse-granule logging
Zero overhead
No user effort
Mixed I/O traffic
No impact on user IO
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
Prior Work: IOSI WorkflowTarget App
(User ID + App ID) Throughput logsJob scheduler logs
Start_time End_time2011-10-16 00:00 2011-10-16 02:012011-10-17 01:00 2011-10-17 04:002011-10-18 05:10 2011-10-18 07:20
Sample set
Per-sample wavelet
transform
Cross-sample I/O burst
identificationData
preprocessingIOSI
IOSI Input
IOSI Output
8
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5
Time (s)
Write
(GB/s
)
8
IOSI paper: “Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces”, FAST '14
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
Prior Work: IOSI WorkflowTarget App
(User ID + App ID) Throughput logsJob scheduler logs
Start_time End_time2011-10-16 00:00 2011-10-16 02:012011-10-17 01:00 2011-10-17 04:002011-10-18 05:10 2011-10-18 07:20
Sample set
Per-sample wavelet
transform
Cross-sample I/O burst
identificationData
preprocessingIOSI
IOSI Input
IOSI Output
9
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s
)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 200 400 600 800 10000
1
2
3
4
5
Time (s)
Writ
e (G
B/s)
0 100 200 300 400 5000
0.5
1
1.5
2
2.5
3
3.5
Time (s)
Write
(GB/s
)
9
IOSI paper: “Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces”, FAST '14
Strong assumption: identical runs of app.
10
AID: Automatic I/O Diverter
Job 1 Job 2 Job 3App1
App2
Job 1 Job 2 Job 3 Job 4App3
App4
App5
App6
Job 1 Job 4Job 3Job 2 Job 5
Job 1 Job 2 Job 3 Job 4 Job 5
Job 1 Job 2 Job 4 Job 5 Job 6Job 3
Time
Start_time End_time2015-10-16 00:00 2015-10-16 02:012015-10-17 01:00 2015-10-17 04:002015-10-18 05:10 2015-10-18 07:20
Job 1 Job 2 Job 4Job 3
Job 5
Scheduling suggestion
Automatically identifying I/O-heavy apps(No prior knowledge, no user involvement)
11
AID: Automatic I/O Diverter
Job 1 Job 2 Job 3App1
App2
Job 1 Job 2 Job 3 Job 4App3
App4
App5
App6
Job 1 Job 4Job 3Job 2 Job 5
Job 1 Job 2 Job 3 Job 4 Job 5
Job 1 Job 2 Job 4 Job 5 Job 6Job 3
Time
Start_time End_time2015-10-16 00:00 2015-10-16 02:012015-10-17 01:00 2015-10-17 04:002015-10-18 05:10 2015-10-18 07:20
Job 1 Job 2 Job 4Job 3
Job 5
Scheduling suggestion
SC|16 Tech paper presentation:Thursday 2pm, 355D
Automatically identifying I/O-heavy apps(No prior knowledge, no user involvement)
Application I/O Characterization Results
12
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
Result from 5 months’ Titan I/O traffic and job logs(User verification by email)
Application I/O Characterization Results
13
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
ID Node Time(m) OST App. Domain
1 8192 1440 64 Geo-sciences
2 250 6-60 1008 Combustion
3 2048 30-185 1008 Astrophysics
4 1760 720 180 Combustion
5 1024 110-230 1008 Systems research
6 200 30-190 1008 Combustion
7 1008 13-17 1008 Computer Science
8 16388 43-310 800 Environmental
User-verified I/O-intensive applications
Application I/O Characterization Results
14
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
Application I/O Characterization Results
15
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
Application I/O Characterization Results
16
Name Value
Total number of logged jobs 181,969
Unique applications identified 9,998
Initial I/O-intensive candidates 95
Candidates passing scope checking 67
Candidates passing minimum support 42
User-verfied candidates 8
Applications not using parallel I/O systems well!• Similar finding as Huong 2015 HPDC work (Darshan)• Motivates better I/O performance data analysis• Connecting programs to systems
Questions?
Xiaosong [email protected]
Qatar Computing Research Institute, Hamad Bin Khalifa University
17
I/O Contention on Large-Scale HPC Systems
• 27.1 PF Peak performance• 18,688 compute nodes
• 16-core AMD Opteron• Nvidia Tesla GPU• 32 + 6 GB memory
• 3-D Torus interconnect
ORNL’sTitan(World’s#3Supercomputer)
18
Performance variance on HPC• Shared parallel file system• I/O-heavy jobs collision -> I/O
performance degradation
I/O performance variance on Titan with IOR [6]
CDF of per-OST I/O throughput
19
88.4%time<1%capacity(5MB/s)
98.5%time<5%capacity(25MB/s)
99.6%time<20%capacity(100MB/s)