28
GGF NetLogger Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney

Distributed Application Analysis and Debugging using NetLogger v2

  • Upload
    marvin

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Distributed Application Analysis and Debugging using NetLogger v2. Brian L. Tierney. Lawrence Berkeley National Laboratory. The Problem. The Problem When building distributed systems, we often observe unexpectedly low performance the reasons for which are usually not obvious - PowerPoint PPT Presentation

Citation preview

Page 1: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Distributed Application Analysis and Debugging using NetLogger v2

Lawrence Berkeley National Laboratory

Brian L. Tierney

Page 2: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

The Problem

• The Problem– When building distributed systems, we often observe

unexpectedly low performance • the reasons for which are usually not obvious

– The bottlenecks can be in any of the following components:

• the applications• the operating systems• the disks, network adapters, bus, memory, etc. on either the

sending or receiving host• the network switches and routers, and so on

Page 3: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Solution

• A complete End-to-End monitoring framework that includes the following components: – instrumentation tools (application, middleware, and OS

monitoring)– host and network sensors (host and network

monitoring)– sensor management tools (sensor control system)– event publication service– event archive service– event analysis and visualization tools– a common set of formats and protocols for describing,

exchanging and locating monitoring data

Page 4: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger Toolkit

• We have developed the NetLogger Toolkit (short for Networked Application Logger), which includes:

– tools to make it easy for distributed applications to log interesting events at every critical point

• NetLogger client library (C, C++, Java, Perl, Python)

– tools for host and network monitoring

– event visualization tools that allow one to correlate application events with host/network events

– NetLogger event archive and retrieval tools (new)

• NetLogger combines network, host, and application-level monitoring to provide a complete view of the entire system.

• Open Source, available at http://www-didc.lbl.gov/NetLogger/

Page 5: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger Analysis: Key Concepts

• NetLogger visualization tools are based on time correlated and object correlated events.– precision timestamps (default = microsecond)

• If applications specify an “object ID” for related events, this allows the NetLogger visualization tools to generate an object “lifeline”

• In order to associate a group of events into a “lifeline”, you must assign an “Event ID” to each NetLogger event– Sample Event ID: file name, block ID, frame ID, etc.

Page 6: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Example: Combined Host and Application Monitoring

Page 7: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger Tuning Results

• I/O followed by processing

• overlapped I/O and processing

almost a 2:1 speedup

Next IO starts when processing ends

remote IO

process previous block

Page 8: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger Analysis of GridFTP: multi-stream bug

Page 9: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Sample NetLogger Use

lp = NetLoggerOpen( progname, x-netlog://loghost.lbl.gov, NL_MEM);

while (!done){

NetLoggerWrite(lp, "EVENT_START", "TEST.SIZE=%d",

size); /* perform the task to be monitored */ done = do_something(data, size); NetLoggerWrite(lp, "EVENT_END");

}NetLoggerClose(lp);

Page 10: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLoggerOpen() shell environment variables

• Enable/Disable logging:

setenv NETLOGGER_ON {true, on, yes, 1}: do logging

setenv NETLOGGER_ON {false, off, no, 0}: do not do logging

• Log Destination: setenv NETLOGGER_DEST logging destination

Examples:

setenv NETLOGGER_DEST file://tmp/netlog.log

write log messages to file /tmp/netlog.log

setenv NETLOGGER_DEST x-netlog://loghost.lbl.gov

send log messages to netlogd on host loghost.lbl.gov, default port setenv NETLOGGER_DEST x-netlog://loghost.lbl.gov:6006 send log messages to netlogd on host loghost.lbl.gov, port 6006

• NETLOGGER_DEST overrides the URL passed in via the

NetLoggerOpen() call.

Page 11: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger Write Call

• Creates and Writes the log event:

NetLoggerWrite(nl, “EVENT_NAME”, “EVENTID=%d F2=%d F3=%s F4=%.2f”, id, user_data, user_string, user_float);

– timestamps are automatically done by library

– the “event name” field is required, all other fields are optional

– this call is thread-safe: automatically does a mutex lock around write call (compile time option)

• Example:

NetLoggerWrite(nl, “HTTPD.START_DISK_READ”, “HTTPD.FILENAME=%s HTTPD.HOST=%s”, filename, hostname);

Page 12: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Event ID

• In order to associate a group of events into a “lifeline”, you must assign an event ID to each NetLogger event

• Sample Event Ids– file name– block ID– frame ID– Socket ID– user name– host name– combination of the above– etc.

Page 13: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Sample NetLogger Use with Event IDs

lp = NetLoggerOpen(progname, NULL, NL_ENV);for (i=0; i< num_blocks; i++) {

NetLoggerWrite(lp, “START_READ”, “BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);

read_block(i);NetLoggerWrite(lp, “END_READ”,

“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);NetLoggerWrite(lp, “START_PROCESS”,

“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);process_block(i); NetLoggerWrite(lp, “END_PROCESS”,

“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);NetLoggerWrite(lp, “START_SEND”,

“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);send_block(i); NetLoggerWrite(lp, “END_SEND”,

“BLOCK_ID=%d BLOCK_SIZE=%d”, i, size);}NetLoggerClose(lp);

Page 14: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

How to Instrument Your Application

• You’ll probably want to add a NetLogger event to the following places in your distributed application:– before and after all disk I/O– before and after all network I/O– entering and leaving each distributed component– before and after any significant computation

• e.g.: an FFT operation

– before and after any significant graphics call• e.g.: certain CPU intensive OpenGL calls

• This is usually an iterative process– add more NetLogger events as you zero in on the

bottleneck

Page 15: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NLV Analysis Tool: Plots Time vs. Event Name

Menu bar

Scale for load-line/pointsEvents

Legend

Zoom windowcontrols

Zoom box

Playback controls

Window sizeMax window size

Zoom-box actions

Playback speed

Summaryline

Time axis

You arehere

Title

Page 16: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Use Case: Instrumented GridFTP server

• An Efficient Data Channel Protocol is extremely important

• For example, consider the following:

– FTP server is instrumented to log the start and end times for all network and disk read and writes, which are in blocks of 64 KBytes

– FTP Server has a fast RAID disk array and 1000-BT network

– 10 simultaneous clients (parallel GridFTP)

– Each client is transferring data at 10 Mbytes/second

– Total server throughput = 100 Mbytes/sec

• This will generate roughly 6250 events / second of monitoring data

• Assuming each monitoring event is 50 Bytes, this equates to 313 KBytes/second, or 1.1 GBytes per hour, of monitoring data.

Page 17: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger v2 Improvements

• Rewrite of client library– Multiple log formats allowed with same API

• ASCII (ULM; name = value)—e.g.: DATE=19980430133038.055784 HOST=foo.lbl.gov PROG=testprog NL.EVNT=SEND_DATA SEND_SIZE=49332

• Binary log format— much better performance, and requires less bandwidth, than

ASCII—Crucial for GridFTP use-case: ASCII NetLogger format not fast

enough

– Other language APIs automatically generated with SWIG

• Much faster than “100% native” implementations, esp. for script languages such as Perl, Python, and TCL

• Changes and bug fixes in core automatically propagated to all APIs

Page 18: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger v2 Improvements

• Reliability– Provide fault-tolerance in WAN environment– Recovers gracefully from network problems– Also recovers from restart/reboot of the log receiver

• Trigger API– Dynamically start, stop, change logging levels

Page 19: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

New NetLogger Feature: Binary Format

• Data is sent in native format: “receiver-makes-right”• IEEE floating point only, simple data types supported• Lots of buffering (64K) to reduce system call overhead

0

100

200

300

400

500

600

700

c java

Language

1000 e

ven

ts/s

ec

binary

ascii

pbio

log4j

Page 20: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

New NetLogger Feature:Reliability API

• Reliability API is used to provide fault tolerance– if remote NetLogger receiver become unavailable, will automatically

failover to alternate location (e.g: local disk or 2nd receiver)– Option to periodically try to reconnect to the remote server

• send local disk file on reconnect

• NetLoggerSetBackup( handle, char *backupURL, int SendonReconnect )– Set the backup URL– SendonReconnect flag: if backup URL is local disk, send these results

after a successful reconnect• NetLoggerSetReconnect( handle, int sec )

– Set the number of seconds between reconnect attempts when the backup URL is in use

Page 21: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

New NetLogger Feature:Trigger API

• Trigger API is used to activate monitoring from an external “activation service”

• NetLoggerSetTrigger( handle, char *filename. int sec)– Check the configuration file specified in “filename”,

every “sec” seconds• int NetLoggerSetTriggerFunc( handle, fn_t *fnptr, int sec )

– Call a user-defined trigger function every “sec” seconds• NetLoggerCheckTrigger( handle)

– Check the trigger function immediately (in addition to the periodic check from NetLoggerSetTrigger).

Page 22: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Monitoring Event Archive

• Archived event data is required for – performance analysis and tuning

• compare current performance to previous results

– accounting• The archive must be extremely high performance and scalable to

ensure that it does not become a bottleneck.– heavily loaded FTP server could generate > 6000 monitoring

event per second• Must use pipelining to guarantee that applications and sensors

never block when writing to the archive • buffer event data on disk

• SQL capability desirable– ability to do complex queries, find averages, standard

deviations, etc.

Page 23: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Monitoring Event Archive

EventProducer

event databuffer

reads events from diskbuffer and sends them toarchive consumer

Data Server (e.g. FTP)

Reads events fromproducer and writesto disk buffer

ArchiveConsumer

monitoringDB

event databuffer

ArchiveFeeder

reads events from diskbuffer and feeds them toSQL database at acontrolled rate

Event Archive

Instrumentedapplication writes

events to disk buffer

network

Requirements for FTP Server monitoring:

• 6200 events/sec • 313 KBytes/second, (1.1 GBytes / hour)

Page 24: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Current Work

Page 25: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

NetLogger + GMA

• NetLogger + Grid Monitoring Architecture (GMA)– GMA is a publish / subscribe event system– Using SOAP for control channel– Using NetLogger binary for data channel

• Sender: NetLoggerWrite()• Receiver: NetLoggerRead()

• NetLogger binary format is a very efficient transport protocol for simple event data

Page 26: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Activation Service

event databuffer

Activation Serviceadds entries to“trigger file” to startlogging certainevents for a givenclient

Trigger File: event type A: log host X event type B: log host Y event type C: log host Z …..

NetLogger calls inapplication periodicallychecks “trigger file” fornew entry

Consumer(sends

activationrequest)

NetLoggerInstrumentedApplication

Create output bufferfile on disk

Send events backto consumer

Event Filter /BufferManager

TriggerManager

Event Producer

1.

2.

4.

5.

6.

7.

3.

Activation Service

• We are using the GMA producer/consumer interfaces to build an activation service for NetLogger– Creates the “trigger file” used by the NetLogger trigger API

Page 27: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

Timestamps:Clock Synchronization Issues

• To correlate events from multiple systems requires synchronized clocks– NTP (Network Time Protocol) is required to synchronize the clocks

of all systems • How accurate does this synchronization need to be?

– We have found that to analyze systems from the “user perspective” requires:

• microsecond resolution between events on a single host (gettimeofday() system call)

• millisecond resolution between WAN hosts—fairly easy to achieve this with NTP

• somewhere in between for LAN hosts

• Recommendation: everyone use IETF timestamp standard– example: 2002-01-18T21:20:07.401662Z– YYYY-MM-DDTHH:MM:SS.SZ (T=date-time separator, Z = GMT)– http://www.ietf.org/internet-drafts/draft-ietf-impp-datetime-05.txt

Page 28: Distributed Application Analysis and Debugging using NetLogger v2

GGFNetLogger

For More Information

http://www-didc.lbl.gov/NetLogger

email: [email protected], [email protected]