20
Swift Distributed Tracing Method and Tools by Zhang Hua (Edward) Standards Team/ETI/CDL/IBM

Swift distributed tracing method and tools v2

Embed Size (px)

DESCRIPTION

A proposal of Swift session for OpenStack Atlanta design summit. http://junodesignsummit.sched.org/event/0f185cd5bcc2c9b58c639bba25bc0025#.U3SZRa1dXd4 http://summit.openstack.org/cfp/details/354

Citation preview

Page 1: Swift distributed tracing method and tools v2

Swift Distributed Tracing Method and Tools

by Zhang Hua (Edward)Standards Team/ETI/CDL/IBM

Page 2: Swift distributed tracing method and tools v2

Agenda Background Tracing Proposal Tracing Architecture Tracing Data Model Tracing Analysis Tools Reference

Page 3: Swift distributed tracing method and tools v2

Background

• Swift is a large scale distributed object store span thousands of nodes across multiple zones and different regions.

– End to end performance is critical to success of Swift.

– Tools that aid in understanding the behavior and reasoning about performance issue are invaluable.

• Motivation

– For a particular client request X, what is the actual route when it is being served by different services? Is there any difference b/w actual route and expected route even we know the access patterns?

– What is the performance behavior of the server components and third-party services? Which part is slower than expected?

– How can we quickly diagnose the problem when it breaks at some points ?

e.g. PUT request X: Client(1) X Proxy-Server (1) Container-Server (1) X1” Account-Server (1) X ’ Container-Server (2) X2” Account-Server (2) Container-Server (3) X3” Account-Server (3)

Page 4: Swift distributed tracing method and tools v2

Which part is slow? Looking at your logs?

When a request is made to Swift, it is given an unique transaction id. This id should be in every log line that has to do with that request. This can be useful when looking at all the services that are hit by a single request. But….is it efficient or handy to do?

Page 5: Swift distributed tracing method and tools v2

Correlate the logs

Proxy server log @ node-P

Container server log @ node-C

Account server log @ node-A

Object server log @ node-O

Correlate the information pieces by transaction id and client IP from all logs of related hashed nodes!

Page 6: Swift distributed tracing method and tools v2

• Counters + Counter_rate(sampling)

– Proxy-Server.{ACO}.{METHOD}.{CODE}

– {ACO}-server.{METHOD}.{CODE}

• Timers + Timer_data

– {ACO}-{DAEMON}.timing

– {ACO}-{DAEMON}.error.timing

– {ACO}-server.{METHOD}.timing

StatsD Metrics

StatsD logging options: # access_log_statsd_host = localhost# access_log_statsd_port = 8125# access_log_statsd_default_sample_rate = 1.0# access_log_statsd_sample_rate_factor = 1.0# access_log_statsd_metric_prefix =# access_log_headers = false# log_statsd_valid_http_methods = GET,HEAD,POST,PUT,DELETE,COPY,OPTIONS

Page 7: Swift distributed tracing method and tools v2

Pros and cons of current implt.

• ReThink itCan we provide a real time end to end performance tracing/tracking tool in Swift infrastructure for developers and users to facilitate their analysis in development and operation environment?

statsD logging

Pros • Real time performance metrics to monitor the health of Swift cluster

• Performance impact is low by sending metrics data via UDP protocol, no hit on local disk I/O

• Supported by different backend to report and visualization

• Light-weighted• Simple to use• Rich logging tools

cons • Designed for cluster level healthy, not for end to end performance.

• Can not provide metrics data for a specific set of requests.

• No relationship between different set of metrics for specific transactions or requests.

• Not designed for real time• Require more efforts to collect

and analysis• No representation for

individual span• Message size limitation

Page 8: Swift distributed tracing method and tools v2

Our Proposal

• Goal

– Target for researchers, developers and admins, provide a method of traceability to understand end to end performance issue and identify the bottlenecks.

• Scope Add WSGI middleware and hooks into swift components to collect trace

data

The middleware to control the activation and generation of trace

Generate trace and span ids, collect the data and tired them together

Send traced data to aggregator and saved into repository

Minor fix of current Swift implementation to allow the path to include complete hops.

Similar to trans-id, the trace-id and span-id need to be propagated through HTTP headers correctly b/w services and components.

Analysis tools of report and visualization

Query the traced data by tiered trace ids

Reconstruct span tree for each trace

Page 9: Swift distributed tracing method and tools v2

Swift Messaging Route

SwiftClient

ProxyServer

ContainerServer

ContainerServer

ContainerServer

AccountServer

Auth

AccountServer

AccountServer

Request-XPUT

Response-XPUT

Request-X’’PUT

Request-X”’PUT Response-X’”PUT

Response-X’’PUT

Create a new container: PUT /account/container

• Swift components talks via HTTP request and response messages.

• It is easy to use HTTP headers as the clue to trace down the route.

Request-X’GET

Response-X’GET

Page 10: Swift distributed tracing method and tools v2

Span Tree of Trace

SwiftClient

ProxyServer

ContainerServer

ContainerServer

ContainerServer

AccountServer

Auth

AccountServer

AccountServer

Request-XPUT

X-Trace-Id: 1234

Response-XPUT

Request-X’’PUT

X-Trace_Id: 1234 X-Span-Id: 1

Request-X”’PUT

X-Trace-Id: 1234 X-Span-Id: 2

Response-X’”PUT

Response-X’’PUT

• X-Trace-Id: identification of each trace

- Use X-Trans-Id to support different cluster?

- Or generate new id for this purpose?

• X-Span-Id: identification of each span to represent individual HTTP RESTful call and WSGI call.

- Generate new span id for this purpose

(notes: UUID can be used for implementation)

Create a new container: PUT /account/container

Request-X’GET

Response-X’GET

Page 11: Swift distributed tracing method and tools v2

X-trace Middleware Architecture1. Generate trace ids based on

configuration.2. Create spans and collect trace

data3. Propagate trace ids to next hop4. Send trace data into a repository

via separate transport protocol/channel

SwiftClient

ProxyServer

ContainerServer

ContainerServer

ContainerServer

AccountServer

Auth

AccountServer

AccountServer

x-trace

x-trace

x-tr

ace

Trace

data

re

posi

tory

x-trace

Page 12: Swift distributed tracing method and tools v2

Patches to fix the request path• The trace id is passed along by

proxy server in HTTP headers, but will be lost at some points because of recreating a new request for next hops.

• Patches are needed to fix this problem to form a complete tracing path for container server, object server, etc.

SwiftClient

ProxyServer

ContainerServer

ContainerServer

ContainerServer

AccountServer

Auth

AccountServer

AccountServer

x-trace

x-trace

x-tr

ace

Trace

data

re

posi

tory

x-tracepropagate trace id in next new request

Page 13: Swift distributed tracing method and tools v2

Tie together tracing dataReconstruct causal and temporal relationship view for PUT container call

Proxy-Server.PUT parent-span-id=0, span-id=1

timeline

Container-Server.PUT parent-span-id=1, span-id=2

Container-Server.PUT parent-span-id=1, span-id=3

Container-Server.PUT parent-span-id=1, span-id=4

Account-Server.PUTparent-span-id=2, span-id=5

Account-Server.PUTparent-span-id=3, span-id=6

Account-Server.PUTparent-span-id=4, span-id=7

0 ms

200 ms50 ms

150 ms

100 ms

Swift-Client.PUT parent-span-id=none, span-id=0

201

201

201

201

201

201

201

Page 14: Swift distributed tracing method and tools v2

Another example: upload an object

Proxy-Server.PUT parent-span-id=0, span-id=1

timeline

Object-Server.PUT parent-span-id=1, span-id=2

Object-Server.PUT parent-span-id=1, span-id=3

Object-Server.PUT parent-span-id=1, span-id=4

Container-Server.PUTparent-span-id=2, span-id=5

Container-Server.PUTparent-span-id=3, span-id=6

Container-Server.PUTparent-span-id=4, span-id=7

0 ms

200 ms50 ms

150 ms

100 ms

Swift-Client.PUT parent-span-id=none, span-id=0

201

201

201

201

201

201

201

Page 15: Swift distributed tracing method and tools v2

pipeline:main

Trace into middleware of the pipeline

• Expand the trace path into WSGI call b/w middleware to get more complete trace data.

• Possible choices

– Decorators for __call__

@trace_here()

def __call__(self, environ, start_response)

– Hack paste deployment package

– Profile with filters

SwiftClient

ProxyServer

x-trace

Trace

data

re

posi

torytempauth

cache

tempurl

dlo

Pipeline = catch_errors gatekeeper healthcheck proxy-logging cache container_sync bulk slo dlo ratelimit crossdomain tempauth tempurl formpost staticweb container-quotas account-quotas proxy-logging proxy-serve

slo

Page 16: Swift distributed tracing method and tools v2

Backend trace data model{

"_id" : "14a467a402904aee87de4028a8595493",

"endpoint" : {"port" : "6031","type" : "server","name" : "container.server","ipv4" : "127.0.0.1"

},"name" : "GET","parent" :

"57fbd3ec12fe4912ba89e7a8eb97f2e7","start_time" : 1400146616.554865,"trace_id" :

"d7ff028674c5471e94b964ec37d35546","end_time" : 1400146616.559608,"annotations" : [

{"type" : "string","value" :

"/sdb1/347/TEMPAUTH_test/summit","key" :

"request_path","event" : "sr"

},{

"type" : "string","value" : "200 OK","key" :

"return_code","event" : "ss"

}]

}

{"_id" :

"57fbd3ec12fe4912ba89e7a8eb97f2e7","endpoint" : {

"port" : "8080","type" : "server","name" : "proxy.server","ipv4" : "127.0.0.1"

},"name" : "GET","parent" :

"5602ca4010fe420c9fa56528faf711ab","start_time" : 1400146616.490691,"trace_id" :

"d7ff028674c5471e94b964ec37d35546","end_time" : 1400146616.58012,"annotations" : [

{"type" : "string","value" :

"/v1/TEMPAUTH_test/summit","key" :

"request_path","event" : "sr"

},{

"type" : "string","value" : "200 OK","key" :

"return_code","event" : "ss"

}]

}

Page 17: Swift distributed tracing method and tools v2

Query and analysis tools

• Query

– Query trace data by trace_id, span_id, order or range by time, group by nodes, annotation keys

• Trace timeline

– Plot the spans on the timeline with causal relationships

• Diagnose

– Analyze the critical path for a success response

– Identify the failure point of in the path

• Simulation

– Replay the recorded processing of the requests

• Data Mining

Page 18: Swift distributed tracing method and tools v2

Reference

• Google Dapper – a large-scale distributed systems tracing infrastructure

• Twitter Zipkin - a distributed tracing system that helps us gather timing data for all the disparate services at Twitter.

• Berkeley XTrace : a pervasive network tracing framework

Page 19: Swift distributed tracing method and tools v2

Demo

Page 20: Swift distributed tracing method and tools v2

Q&A