Instrumentation Strategies for Response Time Management of Distributed Systems Greg Rogers MACP Consulting

Instrumentation Strategies forResponse Time Management

of Distributed Systems

Greg RogersMACP Consulting

Personal Introduction

• Architect for three response time measurement & analysis system projects (project manager for one)

• MACP Consulting, 2007 (Measurement, Analysis, Capacity & Performance)

• ~20 years commercial field, operating system internals; measurement & performance analysis; capacity planning/analytical & statistical modeling; databases

• Digital Equipment Corp (Digital, DEC); Compaq Computer Corp; Hewlett-Packard; early career @ Grumman Aerospace

• B.S. Statistics, Minor Computer Science, California Polytechnic State University

9/24/2008MACP Consulting

Time

• We all have an intuitive feel for what time is, since we were children• “Are we there yet?”

• Next, our parents and teachers made sure our intuitive notions graduated to the quantitative– Early measurement: Learning to read the clock and “tell

time”; calculate elapsed time• Time is a sequence of events, one after the other (Einstein?)

MACP Consulting9/24/2008

Response Time (Rt)Definition; Terminology; Viewpoint

• Time measured from the initiation of some action, event or request until completion of the action or event, or initial receipt of the response– Terminals– GUIs– Service time (St) is a subset of Response time (Rt)– Queue analysis

• Specifics depend on viewpoint – Where and What– Not philosophical– Viewpoint: The system or component(s) of interest– Where, what part of “the system” is to be examined, to be measured?

• In fact, what is the system?– Rt vs. Residence time “in the literature”

– System vs. component of system– Lazowska (1984); Gunther (2000,2005); Menasce(1993)


Why Measure Response Time?

• THE Quality Measurement• Primary business perception of IT• SLAs• Management bragging rights?

9/24/2008 MACP Consulting

Original Flavor, a.k.a. The Good Old Days:Measurement on Monolithic Systems

(do they even exist anymore?)

• Users connected through ye olde character cell terminals– a.k.a. green screens

• User’s transaction normally executed within context of a single system– 2008: Single system = operating system instance = “image"


Monolithic

Instrumented terminal driver & interactive process

User serial terminals

Host

Distributed Architecture Response Time

• Client/Server 2-tier, to 3-, 4-tier, multi-tier distributed systems

• End-to-end Rt– Normally viewed from (business) user perspective– Rt of user’s web form entry; click corresponding to some

business transaction; etc.– Total time from click or carriage return (initial request) to

first character, packet, data item of the response– In other words, sum of time for all visits across

architectural tiers by the user’s transaction– Can be measured at client, or just before or at the first tier

of the infrastructure


Client

Response

Request

Server

Two-tier Client/Server : Hint of the Explosion (and Troubleshooting Difficulty) To Come…

Client

End-to-End Rt

Web App DB

Four-Tier Web Architecture(Three-Tier Measured Rt)

The Explosion Is Here!

Client

End-to-End Rt

Web App DB

ExternalSystem(s)

Multi-tier Web

Issues With Multi-Tier Distributed Systems

• “Sum of time for all visits across architectural tiers by the transaction” (previous slide)• If a transaction is slow, where is the slowdown occurring?• Distributed systems not instrumented in an integrated fashion; i.e., no

standards “easily implemented” (development impact) to provide this data to operations/performance/capacity planning

• Mythology pervades troubleshooting distributed environments due to lack of essential cross-tier Rt data

• Typical Approach: “Guilt by Correlation”– Look at each server (or all servers within a functional tier if one is lucky)– Visually correlate in time, high activity on server(s) in one tier with high activity

with server(s) in the next tier


Clocks and Time Measurement forMultiple-tier Distributed Architectures

• Time synchronization across systems is critical• Standard: Network Time Protocol (NTP)

• One to sub-second accuracy across systems on a LAN• Storage subsystems often do not support NTP

• Specialized, high accuracy, non-distributed server-attached clocks– Cellular telephone tower clock signals– Global Positioning System (GPS) clock signals– Accuracy to ~tens of microseconds


Categories of Rt Instrumentation

• Active• Host-based, on systems executing business applications

• Passive• No software on host systems

• Hybrid• Uses both techniques

• These definitions tend to be from a server-centric viewpoint• A network-centric viewpoint of active vs. passive might be whether or

not traffic is injected onto the network – Krishnamurthy (2001)• Server-centric, since our goal is to provide a breakdown of a large end-

to-end response time into each individual tier’s response time• The fact network and server response time components are measured is incidental

to this goal


Active Rt Monitoring Techniques

• Host-based - Most common & familiar• Synchronous sampling of event-driven Rt accumulators• Asynchronous (event-driven; i.e., when specific event occurs)

• Web server Logging• Rich data source often mined and written to multi-dimensional data

warehouses for customer behavior pattern analysis• Can be great source of distributed Rt data but requires custom

development to process into usable data

• Middleware Logging• Transaction processing, transaction reformat/redirect systems• Also a very rich source, also needs custom development to process

• Application-level Logging• Custom routines or standard Application Programming Interfaces (APIs)


Active Rt Monitoring Techniques, cont’d• Application Response time Measurement (ARM) API

• Standardization effort initiated by HP & Tivoli, adopted by Open Group 1999

• CMG has a dated Q&A still useful as introductory info (ignore links)• http://regions.cmg.org/regions/cmgarmw/armfaq.html

• Callable routines in C & Java for developers to instrument their code for collecting Rt data

• Current version ARM 4.0 v2• http://www.opengroup.org/management/arm/

• Brief history:• http://findarticles.com/p/articles/mi_m0EIN/is_1999_Jan_26/ai_53640469

• ARM is a moderately successful standard• SAS implements ARM in their products and exposes it through macros

• http://support.sas.com/rnd/scalability/tools/arm/armapi.html

• Siebel CRM implements ARM in its “SARM” logging levels, some of which can impact the server and hence are oriented toward debugging, not routine data collection


http://regions.cmg.org/regions/cmgarmw/armfaq.html

http://www.opengroup.org/management/arm/

http://findarticles.com/p/articles/mi_m0EIN/is_1999_Jan_26/ai_53640469

http://support.sas.com/rnd/scalability/tools/arm/armapi.html

Active Rt Monitoring Techniques, cont’d

• Middleware-dependent APIs• Java Management Extensions (JMX)• http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/

• Java Virtual Machine (JVM) Bytecode Instrumentation• Commercial products used to profile or analyze Java app performance

• The best way to peer into that little JVM black box, but I digress…

• Dynamically loaded at run-time• Useful source of distributed Rt data

• Java method calls (“methods”) to remote systems– In this case, the method Rt is a distributed Rt!

• Methods making remote calls can be discovered via sorting method names by Rt– Not a bad way to go in situations where little is known about deeper levels of the application

• Make a map of remote call methods– Might be a way to filter further and characterize different transaction classes by a given remote method

if the data regularly show high Rt variability


http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/


• Synthetic Sampling– Injecting synthetic requests onto the network from PC-based “robots”

and measuring response time of “representative” user transactions• Similar idea to how Keynote Systems measures the top Internet web sites• Has been a popular technique implemented by a number of large vendors for

products that sample of end-to-end Rt measurements on corporate LANs/WANs

– More and more customers are demanding that all their business transactions measured, not just a small sample of synthetic requests

– Applications are typically intolerant of destructive (write) synthetic transactions (e.g., accounting). Workarounds usually circumvent quality of measurement (e.g., hitting the same “dummy” accounts)

– Used in isolation, synthetic sampling can and will miss causes of long response times

– If already in place, well-implemented synthetic sampling is a “known load” for passive monitoring until the latter can be used to more fully characterize Rt of the real business transaction load



• Insertion of tags, markers, IDs into protocol headers• Relatively recent on the commercial landscape• Custom implementations exist in end-business development

organizations• Strong, forward-looking architects and management team seeing business benefit

• Commercial:• Agent on host inserts tag into outbound protocol/message header• Agent at next tier reads tag

• Logs either locally or centrally for post-processing

• Tracks transactions across tiers, calculates time spent on each tier• “The bouncing ball”

• May measure all traffic but only some of the time, not always “on”

• Custom:• Application instrumented to insert tags in its own requests and responses• Local logging of raw data, post-processing on business system or central system,

insertion into central DB with custom-built visualization and reporting software


Difficulties With Some Active Techniques

• Logging and log processing require development resources– Developers may view instrumentation as another potential source of bugs

• Can be viewed as a delay factor in time-to-market• Custom implementation requires strong architects and

management to make the case and see it through in each development release

• Perceived to impact another limited resource: Testing cycles• Logging levels (degree of detail, types of data logged) can be

implemented to limit impact, but sometimes the logging level needed for “useful” data imposes significant resource utilization overhead for over-taxed servers, or worse, increases application service time (execution time) overhead, affecting throughput scalability


Passive Rt Monitoring Techniques

• Recent technology-driven innovation & economics makes passive instrumentation possible– Processors; NICs; PCI-express I/O; Serial Attached SCSI (SAS) disks;

open source software

• Deeper innovation makes passive instrumentation a reality– Multi-threaded, efficient software design in particular

• Passive monitoring may be commonly referred to as network sniffing, but this is misleading – the technology is far more capable and sophisticated than a “network sniffer” implies – This is real time processing of the complete traffic stream

• Widely deployed in network security monitoring– Though most solutions do not process the entire packet


Passive Rt Monitoring Techniques, cont’d• Passive Rt measurement techniques read all network packets

at strategic points in the network • Either part or all of each packet is processed

– Answers the question, “The response time of what?” [component] • Is it a technical item or a business transaction a manager would care about the

response time of?

• The more of each packet processed:• The more business value can be delivered – The business context of

the transaction is typically at the deepest layer (see slides)• Measured Rt of business-critical transactions (not only technical IT

items); time series counts (throughput); per-transaction or per-transaction class resource profiles (network)

• The heavier the load on the probe – f(λ) (packet arrival rate)• Some points in a network of multi-tier distributed systems can be real fire hoses!

• Major challenge for passive monitoring vendors who try to add value beyond IP, TCP or HTTP headers – do they report dropped packets?


Passive Rt Monitoring Techniques, cont’d• Beware of vendor-speak

• Know your terms and exactly what the “client” and “server” are at any point in a logical infrastructure when capabilities are being discussed

• Use diagrams and take your time to first understand what is being measured in your infrastructure by the vendor’s solution• Reports, graphs, etc. come afterward

• Reporting adds tremendous value but understanding how the fundamental measurements relate to your infrastructure is crucial for determining whether it can solve your business & IT challenges

• Assumptions are often unspoken. Ask more than enough questions and get the answers necessary for everyone to be clearly on the same page

• Is all of the Rt data acquired passively? Are there points in the infrastructure where the response time measurement solution active, not passive? At what logging level; i.e., at what level of impact to the [business-critical] server? Trust but verify…

• Does the solution measure all traffic, all the time, or only part of the time? Anything less than all traffic is sampling, and can miss

• At what part of the packet does the passive solution stop reading? Does it stop at the TCP header and call the rest of it “the application”

• Some are skilled at leaving people with the impression or belief that their solution can do things that it in fact cannot do


Client

User’s End-to-End Rt

Web App Database

Passive Rt Monitoring Across Tiers(Logical Measurement Points In Between Tiers)

Web tier Rt Web-App tier Rt App-DB tier Rt

Complete end-to-end Rt measurement for remote clients may require single

passive measurement at each client location or active client

measurement software

Network-Centric View of Packet

9/24/2008

IP Packet Header

TCP Message Header

EthernetFrame

“Application”Telnet; FTP; SMTP; DNS; NNTP; HTTP…

EthernetFrame

CRC

MACP Consulting

Business/Performance/CP/Application-Centric View of Packet (Business context is often deep inside message body of last protocol)

9/24/2008

IP Packet Header

TCP Message Header

EthernetFrame

HTTP Header

XML Message Header

XML Message Body

EthernetFrame

CRC

IP Packet Header

TCP Message Header

EthernetFrame

Proprietary Middleware

Message Header

Proprietary Middleware Message Body

EthernetFrame

CRC

MACP Consulting

Passive Benefits• Development of passive measurement solutions easily proceeds

without impact to application development & test cycles– One-time scheduled downtime, connect taps or configure span ports– Everything afterward is, well,… Passive!– Taps electrically prevent probes from inadvertent writing into the production

network path

• Quality measurements of any and all business transactions– Much additional business data can be filtered, persisted and reported

• Aid to testing & development: Visibility• Aid to production: Visibility (knowledge), myth-busting quantitative

performance data, improves time to problem resolution– And delicious capacity planning data…

• Detailed traces and very precise timing


Documents

Instrumentation Strategies for Response Time Management of Distributed Systems Greg Rogers MACP Consulting