My Research Experiences on Computer Performance Optimization

My Research Experiences onComputer Performance Optimization

Shih-Hao Hung, Ph.D.Sun Microsystems Inc.

Confidential Information – Do Not Forward

112/04/22 Confidential - Do Not Forward 2

Computer Performance• Demands for high performance:

– Getting jobs done faster– Getting more jobs done at the same time– Getting complex jobs done in time

• Price-performance trade-offs:– Getting jobs done efficiently– Getting jobs done with limited resources– Capacity planning


Performance Optimization

applic ation

m a chine

perform a nceprofiling

perform ance

modi fying th e appl ication

adjustin g the machine

tuningana lysisperfo rm a nce


My Research Background• High-Performance/Technical Computing

Parallel Performance Project University of Michigan, 1993-2000– Parallelization for high-performance applications– Performance characterization tools– Performance optimization methodologies

• Commercial Applications OptimizationPerformance and Availability EngineeringSun Microsystems, Inc., 2000-present– Database server optimization– Network stack optimization– Webserver optimization– Security Infrastructure optimization


Parallel Performance Project• Started by Prof. Edward S. Davidson of U of Michigan in

1990, funded by NSF, Ford Motor Co., UM Center of Parallel Computing, IBM, DoD, etc.

• Produced 11 Ph.D.’s in 10 years.

• Work on state-of-the-art parallel supercomputers and realistic applications

• Covers many aspects of computer architecture, from CPU pipelines to clustered systems.

• Optimization by all means: instruction scheduling, memory locality, parallelization, etc. via compiler techniques and hand-tuning.


Parallel Computing• Very hot in the 90’s:

– People rushed to build large MPP’s.– Looks good in theory, but lack of practical tools and experiences.– Most existing apps are difficult to parallelize.– Failed to race with Moore’s Law. R&D cycle too expensive and too long

to catch up with increase of CPU Mhz-Ghz.

• Looking ahead:– Throughput computing and commercial workload drive MP.– Chip density and area favors SMT & CMP designs.– Struggling to find ways to keep the same growth of Ghz.– Multiple-core processors, multiple-processors systems are becoming

the norm in the coming years.


Optimizing Parallel Applications• Very complex, difficult problems:

– Program parallelization– Load balance– Scheduling– Minimize interprocessor communications– Architecture-dependent optimization

• Today:– Still lots of open problems.– Parallelizing compilers are far from automatic solutions.

• Tomorrow:– Further research and practical solutions will be in high demand

as MP systems becomes popular at all levels.


Hierarchical Performance Bounds

I-Parallelization (IP) bound*

Machine (M) bound

M-Application (MA) bound

MA-Compiler (MAC) bound

MAC-Schedule (MACS) bound

MACS-Cache (MACS$) bound

IP-Communication (IPC) bound*

IPC-Load balance (IPCL) bound*

IPCL-Multiphase (IPCLM) bound*

Actual run time

Dynamic load behavior

Behaviors not modeled

Multiple program phases

Overall load imbalance

Interprocessor communication

Partial parallelization

Finite cache effect

Data dependency, branches,

Compiler-inserted instructions

pipeline bubbles

Mismatched application workload

performance gaps

Machine peak performance

IPCLM-Dynamic (IPCLMD) bound

Ideal-parallel (I) bound Uniprocessor bounds

Parallel bounds

X

D

M’LC’P

$SCAM

incr

easi

ng ti

me

constraints on runtime


Example: FCRASH• Vehicle crash simulation at Ford.• Finite-element code contains over

10,000 Fortran lines and 14 parallel loops.

• Profiled on a NUMA system (HP/Convex SPP-1000).

• P-gap: imperfect parallelization• C’-gap: inter-cluster

communications• L & M’-gaps: Load balancing

issues


Goal-directed Optimization

Balance Dynamic

Load

OptimizeProcessor

Performance

Balance the Load per Phase

ReduceSynchroniza-tion Overhead

Balance the Combined Load of Multiple Phases

Ser ia l p r og ra m

P er for ma n ce-tu n ed p a r a l l el p r ogr a m

Step 4

Step 5

Step 3

Step 6 Step 7

Tune the Communication

Performance

Decompose the Problem Domain

Step 1

Step 2

A , C, S , $-gap

C’-g ap

L-g ap M ’-gap D -gap

L , M ’ ,D-g ap


Performance Tuning


Modeling a Parallel Application

program

layout

sourc etraceprofi ledoma in

analys is analysi s/ decom p.S our ces of

m achine input dataAppl icati on

datadepende nc e

A pplicationM odel

cont rolflow

w eightdi stri bution

doma indecomp.

a lgorithmAna lysis

programm e r

P r im a ry p a th A ux ilia r y p a th


Model-Driven Simulation

l a you td a ta

d ep en d en c ec on tr ol

fl oww ei gh td i st r i bu t i on

m a ch i n emodel

d om a in

c om m u n .a n a ly si s

g en er i c

an a l ysi sw or k loa d

a n a lys i s

ca ch e per f .a n al ysi s

d a ta -fl ow

L oad I m balan ceT imin g

S yn cr hon ization C ostSch ed u lin g C ost

Data F low

C om mu n .

C ach e M isses

MDS

C X bou n dan al ysi s

Per f.

Machine &

ModelApplicationd ec om p .

P atter n

B ou nd s


Performance Tuning Results• SP - initial parallel version• SD - changing domain decomposition

to reduce load imbalance (L-gap) and communications (C’-gap)

• SD2 - SD + array padding to reduce false-sharing communications (Unmodeled-gap)

• SD3 - SD2 + eliminating thread migration to reduce communications (Unmodeled-gap)

• SD4 - SD3 + eliminating unnecessary synchronization barriers (S’-gap)


Sun Microsystems• Proud of visions and innovative technologies.• Face fierce competitions in the server business

– OS: Microsoft, Linux– CPU: Intel, IBM– High-end market: IBM, HP– Low-end market: Dell and other x86 vendors

• Still going for the next big thing– Network computing (Java, JDS, JES, GridEngine)– Throughput computing (Niagara 1 & 2, Rock)– Solaris 10 & x86 support


Performance Engineering• Performance problems everywhere…• Deal with important commercial applications:

– Database– Network infrastructure & applications– Throughput computing– Security Infrastructure

• Solve problems by:– Identifying issues– Improving products– Influencing future development


Networking Infrastructure

• Gigabit Ethernet driver optimization• TCP/IP stack optimization• Multi-data transmission and Jumbo Frames• TCP Offloading Engine (TOE)• Infiniband vs 10GE• On-chip high-speed Ethernet support


Networking Applications• Optimizing SunOne servers

– Webserver– Directory server– Application server– Portal server

• Tweaking benchmarks– SPECweb99 & 2004– SPECweb99_SSL– TPC-W (W = Web commerce)


Security Infrastructure• Crypto accelerators• On-chip crypto support • Secure Socket Layer (SSL) & HTTPS acceleration• IPsec & VPN acceleration• Crypto optimization• Solaris Cryptographic Framework


Crypto Acceleration

0

1000

2000

3000

4000

5000

0 2 4 6 8 10No. of Processors

RSA

ops

/sec

SCA1000 Software

0

100

200

300

400

500

600

0 2 4 6 8 10

No. of Processors

3DES

Mbp

s

SCA1000 Software


HTTP/SSL Performance

A. HTTP, 100% Keep Alive

B. HTTP, 0% Keep Alive

C. HTTPS, 100% Keep Alive, no encryption, SHA1 hashing

D. HTTPS, 100% Keep Alive, RC4 encryption, SHA1hashing

E. HTTPS, 0% Keep Alive, 100% session creation (RSA), RC4, SHA1

F. HTTPS, 0% Keep Alive, 100% session resumption (RSA-reuse), RC4, SHA1

http tcp sha1 rc4 rsa_reuse

http sha1

http sha1 rc4

http tcp sha1 rc4 rsa

http tcp

http


HTTPS Pages Cost Breakdown for SPECweb99_SSL

0.00000000

0.00050000

0.00100000

0.00150000

0.00200000

0.00250000

Apache1.3.26w/Deimos

iWS6 w/Deimos NCASw/Deimos

Zeus 4.0r1

Seco

nds

Keep-Alive HTTPResumed SSL HandshakeFull SSL HandshakeRC4 Bulk EncryptionSHA1+SSL_Layer BulkHTTP Transfer


IPsec PerformanceIPsec-3DES RX Throughput on 2-way 900mhz E280R

0

20

40

60

80

100

120

140

160

180

1-conn 2-conn 4-conn 8-conn 16-conn

#conns

mbp

s

Venus-HW

Venus-Sw crypto

Venus-SWkcl

kEF-Sync

kEF-Async

kEF-Deimos

Solaris9


Solaris Cryptographic Framework


Throughput Computing - Niagara


Niagara-2 4-Core Server Competition – Nov. 2007Sun IBM p605-G3 HP rx1640

Niagara-2 Xeon DP PO WER5+ lite K9+ (O pteron) Itanium LV

# of Chips 1 2 2 2 2

Cores / Chip 4 cores /Chip 2 core /Chip 1 core/Chip 2 cores/Chip 2 cores/Chip

Threads / core 8 threads /core 2 threads /core 2 threads /core 1 thread/core 2 threads /core

1.2–1.4GHz 5.9GHz 3.6GHz 4.1GHz 2.8Ghz

GbE 2 GbE 2 GbE 2GbE 2 GbE

SSL On-chip

Entry Config

List Price $2,995 $5,107 $6,890 $7,691 $5,393

SPECWeb_SSL 7100–8000 conn6300–7100 conn

Dell PEdge 1750

IBM 325+ O pteron

Processor Name

CPU Clock Speed

2 10GbE On-chip

Opt PCI-Express Card




1 chip @ 1.4GHz/4GB/2x73

GB

2 chips@ 5.9GHz/4GB/2x73

GB

2 chips @ 3.6 Ghz 4GB/2x73GB

2chips@ 4.1GHz/ 4GB/2x73GB

2 chips @ 2.8Ghz/4GB/2x73

GB

4800–6300 conn

8300–11000 conn

7100–7900 conn


Rock


Conclusion• Will see radical changes in computer systems in the near

future, and system-wide hardware-software co-optimization is key to unleash their potentials.– High density chips– Multi-core CPUs– Highly scalable systems– Enormous network & I/O capacity– Built-in security support

• Performance is an expertise that is best acquired from experiences.

• Methodology and collaboration are our formulas for success.

Documents

My Research Experiences on Computer Performance Optimization