Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH

1

Extensible Distributed Tracing from Kernels to Clusters

Úlfar Erlingsson, Google Inc.Marcus Peinado, Microsoft Research

Simon Peter, Systems Group, ETH ZurichMihai Budiu, Microsoft Research

Fay

2

Wouldn’t it be nice if…

• We could know what our clusters were doing?

• We could ask any question,… easily, using one simple-to-use system.

• We could collect answers extremely efficiently… so cheaply we may even ask

continuously.

3

Let’s imagine...

• Applying data-mining to cluster tracing• Bag of words technique– Compare documents w/o structural knowledge– N-dimensional feature vectors– K-means clustering

• Can apply to clusters, too!

4

Cluster-mining with Fay

• Automatically categorize cluster behavior, based on system call activity

5

Cluster-mining with Fay

• Automatically categorize cluster behavior, based on system call activity – Without measurable overhead on the execution– Without any special Fay data-mining support

6

Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; }

var kernelFunctionFrequencyVectors =

cluster.Function(kernel, “syscalls!*”)

.Where(evt => evt.time < Now.AddMinutes(3))

.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Vectors OneKMeansStep(Vectors vs, Vectors cs) { return vs.GroupBy(v => Nearest(v, cs)) .Select(g => g.Aggregate((x,y) => x+y)/g.Count());}

Vectors KMeans(Vectors vs, Vectors cs, int K) { for (int i=0; i < K; ++i) cs = OneKMeansStep(vs, cs); return cs;}

Fay K-Means Behavior-Analysis Code

7


cluster.Function(kernel, “syscalls!*”)


.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr }) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Fay K-Means Behavior-Analysis Code

8

Fay vs. Specialized Tracing

• Could’ve built a specialized tool for this– Automatic categorization of behavior (Fmeter)

• Fay is general, but can efficiently do– Tracing across abstractions, systems (Magpie)– Predicated and windowed tracing (Streams)– Probabilistic tracing (Chopstix)– Flight recorders, performance counters, …

9

Key Takeaways

Fay: Flexible monitoring of distributed executions– Can be applied to existing, live Windows servers

1. Single query specifies both tracing & analysis– Easy to write & enables automatic optimizations

2. Pervasively data-parallel, scalable processing– Same model within machines & across clusters

3. Inline, safe machine-code at tracepoints– Allows us to do computation right at data source

10

Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (|pt – c| < |pt – near|) near = c; return near; }


cluster.Function(kernel, “*”)


.Select(evt => new { Machine = MachineID(), Interval = w.Cycles / CPS, Function = w.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });



K-Means: Single, Unified Fay Queryvar kernelFunctionFrequencyVectors =

cluster.Function(kernel, “*”)


.Select(evt => new { Machine = fay.MachineID(), Interval = evt.Cycles / CPS, Function = evt.CallerAddr}) .GroupBy(evt => evt, (k,g) => new { key = k, count = g.Count() });

Vector Nearest(Vector pt, Vectors centers) { var near = centers.First(); foreach (var c in centers) if (Norm(pt – c) < Norm(pt – near)) near = c; return near; }



11

Fay is Data-Parallel on Cluster

• View trace query as distributed computation• Use cluster for analysis

12


System call trace events• Fay does early aggregation & data reduction• Fay knows what’s needed for later analysis

13


System call trace events• Fay does early aggregation & data reduction

K-Means analysis• Fay builds an efficient processing plan from query

14

Fay is Data-Parallel within Machines

• Early aggregation• Inline, in OS kernel• Reduce dataflow & kernel/user transitions

• Data-parallel per each core/thread

15

Processing w/o Fay Optimizations

• Collect data first (on disk)• Reduce later• Inefficient, can suffer data overload

K-Means: System calls K-Means: Clustering

16

Traditional Trace Processing

• First log all data (a deluge)• Process later (centrally)• Compose tools via scripting

K-Means: System calls K-Means: Clustering

17

Takeaways so far

Fay: Flexible monitoring of distributed executions

1. Single query specifies both tracing & analysis

2. Pervasively data-parallel, scalable processing

18

Safety of Fay Tracing Probes

• A variant of XFI used for safety [OSDI’06]

– Works well in the kernel or any address space– Can safely use existing stacks, etc.– Instead of language interpreter (DTrace)– Arbitrary, efficient, stateful computation

• Probes can access thread-local/global state• Probes can try to read any address– I/O registers are protected

19

Key Takeaways, Again

Fay: Flexible monitoring of distributed executions

1. Single query specifies both tracing & analysis

2. Pervasively data-parallel, scalable processing

3. Inline, safe machine-code at tracepoints

20

Target

Installing and Executing Fay Tracing

• Fay runtime on each machine• Fay module in each traced address space• Tracepoints at hotpatched function boundary

Tracing Runtime

Fay

User-Space

Kernel

Probe

XFI

Createprobe

Hotpatching

query

ETW

200 cycles

21

Low-level Code Instrumentation

Caller: ... e8ab62ffff call Foo ...

ff1508e70600 call[Dispatcher]Foo: ebf8 jmp Foo-6 ccccccFoo2: 57 push rdi

...

c3 ret

Module with a traced function Foo

• Replace 1st opcode of functions

22




...

c3 ret

Module with a traced function Foo Fay platform module

Dispatcher: t = lookup(return_addr) ...

call t.entry_probes ...

call t.Foo2_trampoline ...

call t.return_probes ... return /* to after call Foo */

• Replace 1st opcode of functions• Fay dispatcher called via trampoline

23


PF5

PF3

PF4



...

c3 ret

Module with a traced function Foo Fay platform module

Dispatcher: t = lookup(return_addr) ...

call t.entry_probes ...

call t.Foo2_trampoline ...

call t.return_probes ... return /* to after call Foo */

Fay probes

XFI XFI

XFI

• Replace 1st opcode of functions• Fay dispatcher called via trampoline• Fay calls the function, and entry & exit probes

24

• Fay adds 220 to 430 cycles per traced function • Fay adds 180% CPU to trace all kernel functions• Both approx 10x faster than Dtrace, SystemTap

What’s Fay’s Performance & Scalability?

Fay Solaris Dtrace

OS X Dtrace

Stap Linux

0

2000

4000

6000

8000

10000

Fay Solaris Dtrace

OS X Dtrace

Stap Linux

05

1015202530

2.8

17.2

26.7 CrashNull-probe overhead Slowdown (x)

Cycl

es

25

Fay Scalability on a Cluster

• Fay tracing memory allocations, in a loop:– Ran workload on a 128-node, 1024-core cluster– Spread work over 128 to 1,280,000 threads– 100% CPU utilization

• Fay overhead was 1% to 11% (mean 7.8%)

26

More Fay Implementation Details

• Details of query-plan optimizations• Case studies of different tracing strategies• Examples of using Fay for performance analysis

• Fay is based on LINQ and Windows specifics– Could build on Linux using Ftrace, Hadoop, etc.

• Some restrictions apply currently– E.g., skew towards batch processing due to Dryad

27

Conclusion

• Fay: Flexible tracing of distributed executions

• Both expressive and efficient– Unified trace queries– Pervasive data-parallelism– Safe machine-code probe processing

• Often equally efficient as purpose-built tools

28

Backup

29

A Fay Trace Query

from io in cluster.Function("iolib!Read")where io.time < Now.AddMinutes(5)let size = io.Arg(2) // request size in bytesgroup io by size/1024 into gselect new { sizeInKilobytes = g.Key,

countOfReadIOs = g.Count() };

• Aggregates read activity in iolib module• Across cluster, both user-mode & kernel• Over 5 minutes

30

A Fay Trace Query

from io in cluster.Function("iolib!Read")where io.time < Now.AddMinutes(5)let size = io.Arg(2) // request size in bytesgroup io by size/1024 into gselect new { sizeInKilobytes = g.Key,

countOfReadIOs = g.Count() };

• Specifies what to trace• 2nd argument of read function in iolib

• And how to aggregate• Group into kb-size buckets and count 1024 2048 4096 8192

0200040006000

Documents

Extensible Distributed Tracing from Kernels to Clusters Úlfar Erlingsson, Google Inc. Marcus Peinado, Microsoft Research Simon Peter, Systems Group, ETH