32
Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University

Making the “Box” Transparent

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Making the “Box” Transparent

Making the “Box” Transparent: System Call Performance as a First-class Result

Yaoping Ruan, Vivek Pai Princeton University

Page 2: Making the “Box” Transparent

2

Outline

n Motivation

n Design & implementation

n Case study n More results

Page 3: Making the “Box” Transparent

3

Beginning of the Story

n Flash Web Server on the SPECWeb99 benchmark yield very low score – 200 n 220: Standard Apache n 400: Customized Apache n 575: Best - Tux on comparable hardware

n However, Flash Web Server is known to be fast

Motivation

Page 4: Making the “Box” Transparent

4

Performance Debugging of OS-Intensive Applications

Web Servers (+ full SpecWeb99) n High throughput

n  Large working sets n Dynamic content n QoS requirements n Workload scaling

How do we debug such a system?

CPU/network

disk activity

multiple programs

latency-sensitive

overhead-sensitive

Motivation

Page 5: Making the “Box” Transparent

5

û Guesswork, inaccuracy

ü Online § Measurement calls § getrusage() § gettimeofday( )

û High overhead > 40%

ü Detailed § Call graph profiling § gprof

û Completeness

ü Fast § Statistical sampling § DCPI, Vtune, Oprofile

û Guesswork, inaccuracy

ü Online § Measurement calls § getrusage() § gettimeofday( )

û High overhead > 40%

ü Detailed § Call graph profiling § gprof

û Completeness

ü Fast § Statistical sampling § DCPI, Vtune, Oprofile

Current Tradeoffs

Motivation

Page 6: Making the “Box” Transparent

6

In-kernel measurement

Example of Current Tradeoffs

gettimeofday( )

Wall-clock time of an non-blocking system call

Motivation

Page 7: Making the “Box” Transparent

7

Design Goals

n Correlate kernel information with application-level information

n  Low overhead on “useless” data n High detail on useful data n Allow application to control profiling and

programmatically react to information

Design

Page 8: Making the “Box” Transparent

8

DeBox: Splitting Measurement Policy & Mechanism

n  Add profiling primitives into kernel n  Return feedback with each syscall

n  First-class result, like errno

n  Allow programs to interactively profile n  Profile by call site and invocation n  Pick which processes to profile n  Selectively store/discard/utilize data

Design

Page 9: Making the “Box” Transparent

9

DeBox Architecture

… res = read (fd, buffer, count) …

errno buffer

read ( ) { Start profiling; … End profiling Return; }

App

Kernel

DeBox Info or

online

offline

Design

Page 10: Making the “Box” Transparent

10

DeBox Data Structure

DeBoxControl ( DeBoxInfo *resultBuf, int maxSleeps, int maxTrace )

Trace Info

Sleep Info

Basic DeBox Info 0 1

…0 1

…2

Performance summary

Blocking information of the call

Full call trace & timing

Implementation

Page 11: Making the “Box” Transparent

11

Sample Output: Copying a 10MB Mapped File

Basic DeBox Info

PerSleepInfo

CallTrace

# of CallTrace used 0 # of PerSleepInfo used 2 # of page faults 989 Call time (usec) 3591064 System call # (write( )) 4

Basic DeBox Info

# processes on exit 0 # processes on entry 1 Line where blocked 2727 File where blocked kern/vfs_bio.c Resource label biowr Time blocked (usec) 723903 # occurrences 1270

PerSleepInfo[0]

fo_write( ) 3586472 dofilewrite( ) 3590326 fhold( ) 5 holdfp( ) 25 write( ) 3591064

CallTrace

… … 2 1 2 1 0

(depth)

Implementation

Page 12: Making the “Box” Transparent

12

In-kernel Implementation

n  600 lines of code in FreeBSD 4.6.2 n Monitor trap handler

n  System call entry, exit, page fault, etc.

n  Instrument scheduler n  Time, location, and reason for blocking n  Identify dynamic resource contention

n  Full call path tracing + timings n  More detail than gprof’s call arc counts

Implementation

Page 13: Making the “Box” Transparent

13

General DeBox Overheads

Implementation

Page 14: Making the “Box” Transparent

14

Case Study: Using DeBox on the Flash Web server

n  Experimental setup n  Mostly 933MHz PIII n  1GB memory n  Netgear GA621

gigabit NIC n  Ten PII 300MHz

clients

0 1

00

20

0 3

00

40

0 5

00

60

0 7

00

80

0 9

00

SPECWeb99 Results

orig

VM Patch

sendfile

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 15: Making the “Box” Transparent

15

Example 1: Finding Blocking System Calls

open()/190

readlink()/84

read()/12

stat()/6

unlink()/1

Blocking system calls, resource blocked and their counts

inode - 28 biord - 162

inode - 84

biord - 3 inode - 9

inode - 6

biord - 1

Case Study

Page 16: Making the “Box” Transparent

16

Example 1 Results & Solution

n Mostly metadata locking

n Direct all metadata calls to name convert helpers

n  Pass open FDs using sendmsg( )

0 1

00

20

0 3

00

40

0 5

00

60

0 7

00

80

0 9

00

SPECWeb99 Results

orig

VM Patch

sendfile

FD Passing

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 17: Making the “Box” Transparent

17

Example 2: Capturing Rare Anomaly Paths

FunctionX( ) { … open( ); … }

FunctionY( ) { open( ); … open( ); }

When blocking happens:

abort( ); (or fork + abort)

Record call path

1.  Which call caused the problem

2.  Why this path is executed (App + kernel call trace)

Case Study

Page 18: Making the “Box” Transparent

18

Example 2 Results & Solution

n Cold cache miss path n Allow read helpers to

open cache missed files

n More benefit in latency

0 1

00

20

0 3

00

40

0 5

00

60

0 7

00

80

0 9

00

SPECWeb99 Results

orig

VM Patch

sendfile

FD Passing

FD Passing

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 19: Making the “Box” Transparent

19

Example 3: Tracking Call History & Call Trace

Call time of fork( ) as a function of invocation

Call trace indicates:

•  File descriptors copy - fd_copy( )

•  VM map entries copy - vm_copy( )

Case Study

Page 20: Making the “Box” Transparent

20

Example 3 Results & Solution

n  Similar phenomenon on mmap( )

n  Fork helper n  eliminate mmap( )

0 1

00

20

0 3

00

40

0 5

00

60

0 7

00

80

0 9

00

SPECWeb99 Results

orig VM Patch

sendfile FD Passing Fork helper

No mmap()

Dataset size (GB): 0.12 0.75 1.38 2.01 2.64

Case Study

Page 21: Making the “Box” Transparent

21

Sendfile Modifications (Now Adopted by FreeBSD)

n Cache pmap/sfbuf entries n Return special error for cache misses n  Pack header + data into one packet

Case Study

Page 22: Making the “Box” Transparent

22

SPECWeb99 Scores

0.12 0.75 1.38 2.01 2.64

0 100 200 300 400 500 600 700 800 900

orig

sendfile FD Passing Fork helper

No mmap()

VM Patch

New CGI Interface New sendfile( )

Dataset size (GB) 0.43 1.06 1.69 2.32 2.95

Case Study

Page 23: Making the “Box” Transparent

23

New Flash Summary

n Application-level changes n  FD passing helpers n  Move fork into helper process n  Eliminate mmap cache n  New CGI interface

n Kernel sendfile changes n  Reduce pmap/TLB operation n  New flag to return if data missing n  Send fewer network packets for small files

Case Study

Page 24: Making the “Box” Transparent

24

Throughput on SPECWeb Static Workload

0 100 200 300 400 500 600 700

500 MB 1.5 GB 3.3 GB Dataset Size

Th

rou

gh

pu

t (M

b/s

)

Apache Flash New-Flash

Other Results

Page 25: Making the “Box” Transparent

25

Latencies on 3.3GB Static Workload

44X

47X

Mean latency improvement: 12X

Other Results

Page 26: Making the “Box” Transparent

26

Throughput Portability (on Linux)

(Server: 3.0GHz P4, 1GB memory)

0 200 400 600 800

1000 1200 1400

500 MB 1.5 GB 3.3 GB Dataset Size

Th

rou

gh

pu

t (M

b/s

)

Haboob Apache Flash New-Flash

Other Results

Page 27: Making the “Box” Transparent

27

Latency Portability (3.3GB Static Workload)

3.7X

112X

Mean latency improvement: 3.6X

Other Results

Page 28: Making the “Box” Transparent

28

Summary

n  DeBox is effective on OS-intensive application and complex workloads n  Low overhead on real workloads n  Fine detail on real bottleneck n  Flexibility for application programmers

n  Case study n  SPECWeb99 score quadrupled

n  Even with dataset 3x of physical memory

n  Up to 36% throughput gain on static workload n  Up to 112x latency improvement n  Results are portable

Page 29: Making the “Box” Transparent

www.cs.princeton.edu/~yruan/debox [email protected] [email protected]

Thank you

Page 30: Making the “Box” Transparent

30

SpecWeb99 Scores

n  Standard Flash 200 n  Standard Apache 220 n Apache + special module 420 n Highest 1GB/1GHz score 575 n  Improved Flash 820 n  Flash + dynamic request module 1050

Page 31: Making the “Box” Transparent

31

SpecWeb99 on Linux

n  Standard Flash 600 n  Improved Flash 1000 n  Flash + dynamic request module 1350

n  3.0GHz P4 with 1GB memory

Page 32: Making the “Box” Transparent

32

New Flash Architecture