Upload
keiji
View
56
Download
0
Embed Size (px)
DESCRIPTION
My Research Experiences on Computer Performance Optimization. Shih-Hao Hung, Ph.D. Sun Microsystems Inc. Confidential Information – Do Not Forward. Computer Performance. Demands for high performance: Getting jobs done faster Getting more jobs done at the same time - PowerPoint PPT Presentation
Citation preview
My Research Experiences onComputer Performance Optimization
Shih-Hao Hung, Ph.D.Sun Microsystems Inc.
Confidential Information – Do Not Forward
112/04/22 Confidential - Do Not Forward 2
Computer Performance• Demands for high performance:
– Getting jobs done faster– Getting more jobs done at the same time– Getting complex jobs done in time
• Price-performance trade-offs:– Getting jobs done efficiently– Getting jobs done with limited resources– Capacity planning
112/04/22 Confidential - Do Not Forward 3
Performance Optimization
applic ation
m a chine
perform a nceprofiling
perform ance
modi fying th e appl ication
adjustin g the machine
tuningana lysisperfo rm a nce
112/04/22 Confidential - Do Not Forward 4
My Research Background• High-Performance/Technical Computing
Parallel Performance Project University of Michigan, 1993-2000– Parallelization for high-performance applications– Performance characterization tools– Performance optimization methodologies
• Commercial Applications OptimizationPerformance and Availability EngineeringSun Microsystems, Inc., 2000-present– Database server optimization– Network stack optimization– Webserver optimization– Security Infrastructure optimization
112/04/22 Confidential - Do Not Forward 5
Parallel Performance Project• Started by Prof. Edward S. Davidson of U of Michigan in
1990, funded by NSF, Ford Motor Co., UM Center of Parallel Computing, IBM, DoD, etc.
• Produced 11 Ph.D.’s in 10 years.
• Work on state-of-the-art parallel supercomputers and realistic applications
• Covers many aspects of computer architecture, from CPU pipelines to clustered systems.
• Optimization by all means: instruction scheduling, memory locality, parallelization, etc. via compiler techniques and hand-tuning.
112/04/22 Confidential - Do Not Forward 6
Parallel Computing• Very hot in the 90’s:
– People rushed to build large MPP’s.– Looks good in theory, but lack of practical tools and experiences.– Most existing apps are difficult to parallelize.– Failed to race with Moore’s Law. R&D cycle too expensive and too long
to catch up with increase of CPU Mhz-Ghz.
• Looking ahead:– Throughput computing and commercial workload drive MP.– Chip density and area favors SMT & CMP designs.– Struggling to find ways to keep the same growth of Ghz.– Multiple-core processors, multiple-processors systems are becoming
the norm in the coming years.
112/04/22 Confidential - Do Not Forward 7
Optimizing Parallel Applications• Very complex, difficult problems:
– Program parallelization– Load balance– Scheduling– Minimize interprocessor communications– Architecture-dependent optimization
• Today:– Still lots of open problems.– Parallelizing compilers are far from automatic solutions.
• Tomorrow:– Further research and practical solutions will be in high demand
as MP systems becomes popular at all levels.
112/04/22 Confidential - Do Not Forward 8
Hierarchical Performance Bounds
I-Parallelization (IP) bound*
Machine (M) bound
M-Application (MA) bound
MA-Compiler (MAC) bound
MAC-Schedule (MACS) bound
MACS-Cache (MACS$) bound
IP-Communication (IPC) bound*
IPC-Load balance (IPCL) bound*
IPCL-Multiphase (IPCLM) bound*
Actual run time
Dynamic load behavior
Behaviors not modeled
Multiple program phases
Overall load imbalance
Interprocessor communication
Partial parallelization
Finite cache effect
Data dependency, branches,
Compiler-inserted instructions
pipeline bubbles
Mismatched application workload
performance gaps
Machine peak performance
IPCLM-Dynamic (IPCLMD) bound
Ideal-parallel (I) bound Uniprocessor bounds
Parallel bounds
X
D
M’LC’P
$SCAM
incr
easi
ng ti
me
constraints on runtime
112/04/22 Confidential - Do Not Forward 9
Example: FCRASH• Vehicle crash simulation at Ford.• Finite-element code contains over
10,000 Fortran lines and 14 parallel loops.
• Profiled on a NUMA system (HP/Convex SPP-1000).
• P-gap: imperfect parallelization• C’-gap: inter-cluster
communications• L & M’-gaps: Load balancing
issues
112/04/22 Confidential - Do Not Forward 10
Goal-directed Optimization
Balance Dynamic
Load
OptimizeProcessor
Performance
Balance the Load per Phase
ReduceSynchroniza-tion Overhead
Balance the Combined Load of Multiple Phases
Ser ia l p r og ra m
P er for ma n ce-tu n ed p a r a l l el p r ogr a m
Step 4
Step 5
Step 3
Step 6 Step 7
Tune the Communication
Performance
Decompose the Problem Domain
Step 1
Step 2
A , C, S , $-gap
C’-g ap
L-g ap M ’-gap D -gap
L , M ’ ,D-g ap
112/04/22 Confidential - Do Not Forward 11
Performance Tuning
112/04/22 Confidential - Do Not Forward 12
Modeling a Parallel Application
program
layout
sourc etraceprofi ledoma in
analys is analysi s/ decom p.S our ces of
m achine input dataAppl icati on
datadepende nc e
A pplicationM odel
cont rolflow
w eightdi stri bution
doma indecomp.
a lgorithmAna lysis
programm e r
P r im a ry p a th A ux ilia r y p a th
112/04/22 Confidential - Do Not Forward 13
Model-Driven Simulation
l a you td a ta
d ep en d en c ec on tr ol
fl oww ei gh td i st r i bu t i on
m a ch i n emodel
d om a in
c om m u n .a n a ly si s
g en er i c
an a l ysi sw or k loa d
a n a lys i s
ca ch e per f .a n al ysi s
d a ta -fl ow
L oad I m balan ceT imin g
S yn cr hon ization C ostSch ed u lin g C ost
Data F low
C om mu n .
C ach e M isses
MDS
C X bou n dan al ysi s
Per f.
Machine &
ModelApplicationd ec om p .
P atter n
B ou nd s
112/04/22 Confidential - Do Not Forward 14
Performance Tuning Results• SP - initial parallel version• SD - changing domain decomposition
to reduce load imbalance (L-gap) and communications (C’-gap)
• SD2 - SD + array padding to reduce false-sharing communications (Unmodeled-gap)
• SD3 - SD2 + eliminating thread migration to reduce communications (Unmodeled-gap)
• SD4 - SD3 + eliminating unnecessary synchronization barriers (S’-gap)
112/04/22 Confidential - Do Not Forward 15
Sun Microsystems• Proud of visions and innovative technologies.• Face fierce competitions in the server business
– OS: Microsoft, Linux– CPU: Intel, IBM– High-end market: IBM, HP– Low-end market: Dell and other x86 vendors
• Still going for the next big thing– Network computing (Java, JDS, JES, GridEngine)– Throughput computing (Niagara 1 & 2, Rock)– Solaris 10 & x86 support
112/04/22 Confidential - Do Not Forward 16
Performance Engineering• Performance problems everywhere…• Deal with important commercial applications:
– Database– Network infrastructure & applications– Throughput computing– Security Infrastructure
• Solve problems by:– Identifying issues– Improving products– Influencing future development
112/04/22 Confidential - Do Not Forward 17
Networking Infrastructure
• Gigabit Ethernet driver optimization• TCP/IP stack optimization• Multi-data transmission and Jumbo Frames• TCP Offloading Engine (TOE)• Infiniband vs 10GE• On-chip high-speed Ethernet support
112/04/22 Confidential - Do Not Forward 18
Networking Applications• Optimizing SunOne servers
– Webserver– Directory server– Application server– Portal server
• Tweaking benchmarks– SPECweb99 & 2004– SPECweb99_SSL– TPC-W (W = Web commerce)
112/04/22 Confidential - Do Not Forward 19
Security Infrastructure• Crypto accelerators• On-chip crypto support • Secure Socket Layer (SSL) & HTTPS acceleration• IPsec & VPN acceleration• Crypto optimization• Solaris Cryptographic Framework
112/04/22 Confidential - Do Not Forward 20
Crypto Acceleration
0
1000
2000
3000
4000
5000
0 2 4 6 8 10No. of Processors
RSA
ops
/sec
SCA1000 Software
0
100
200
300
400
500
600
0 2 4 6 8 10
No. of Processors
3DES
Mbp
s
SCA1000 Software
112/04/22 Confidential - Do Not Forward 21
HTTP/SSL Performance
A. HTTP, 100% Keep Alive
B. HTTP, 0% Keep Alive
C. HTTPS, 100% Keep Alive, no encryption, SHA1 hashing
D. HTTPS, 100% Keep Alive, RC4 encryption, SHA1hashing
E. HTTPS, 0% Keep Alive, 100% session creation (RSA), RC4, SHA1
F. HTTPS, 0% Keep Alive, 100% session resumption (RSA-reuse), RC4, SHA1
http tcp sha1 rc4 rsa_reuse
http sha1
http sha1 rc4
http tcp sha1 rc4 rsa
http tcp
http
112/04/22 Confidential - Do Not Forward 22
HTTPS Pages Cost Breakdown for SPECweb99_SSL
0.00000000
0.00050000
0.00100000
0.00150000
0.00200000
0.00250000
Apache1.3.26w/Deimos
iWS6 w/Deimos NCASw/Deimos
Zeus 4.0r1
Seco
nds
Keep-Alive HTTPResumed SSL HandshakeFull SSL HandshakeRC4 Bulk EncryptionSHA1+SSL_Layer BulkHTTP Transfer
112/04/22 Confidential - Do Not Forward 23
IPsec PerformanceIPsec-3DES RX Throughput on 2-way 900mhz E280R
0
20
40
60
80
100
120
140
160
180
1-conn 2-conn 4-conn 8-conn 16-conn
#conns
mbp
s
Venus-HW
Venus-Sw crypto
Venus-SWkcl
kEF-Sync
kEF-Async
kEF-Deimos
Solaris9
112/04/22 Confidential - Do Not Forward 24
Solaris Cryptographic Framework
112/04/22 Confidential - Do Not Forward 25
Throughput Computing - Niagara
112/04/22 Confidential - Do Not Forward 26
Niagara-2 4-Core Server Competition – Nov. 2007Sun IBM p605-G3 HP rx1640
Niagara-2 Xeon DP PO WER5+ lite K9+ (O pteron) Itanium LV
# of Chips 1 2 2 2 2
Cores / Chip 4 cores /Chip 2 core /Chip 1 core/Chip 2 cores/Chip 2 cores/Chip
Threads / core 8 threads /core 2 threads /core 2 threads /core 1 thread/core 2 threads /core
1.2–1.4GHz 5.9GHz 3.6GHz 4.1GHz 2.8Ghz
GbE 2 GbE 2 GbE 2GbE 2 GbE
SSL On-chip
Entry Config
List Price $2,995 $5,107 $6,890 $7,691 $5,393
SPECWeb_SSL 7100–8000 conn6300–7100 conn
Dell PEdge 1750
IBM 325+ O pteron
Processor Name
CPU Clock Speed
2 10GbE On-chip
Opt PCI-Express Card
Opt PCI-Express Card
Opt PCI-Express Card
Opt PCI-Express Card
1 chip @ 1.4GHz/4GB/2x73
GB
2 chips@ 5.9GHz/4GB/2x73
GB
2 chips @ 3.6 Ghz 4GB/2x73GB
2chips@ 4.1GHz/ 4GB/2x73GB
2 chips @ 2.8Ghz/4GB/2x73
GB
4800–6300 conn
8300–11000 conn
7100–7900 conn
112/04/22 Confidential - Do Not Forward 27
Rock
112/04/22 Confidential - Do Not Forward 28
Conclusion• Will see radical changes in computer systems in the near
future, and system-wide hardware-software co-optimization is key to unleash their potentials.– High density chips– Multi-core CPUs– Highly scalable systems– Enormous network & I/O capacity– Built-in security support
• Performance is an expertise that is best acquired from experiences.
• Methodology and collaboration are our formulas for success.