19
Compatibility Testing via Patterns-Based Trace Comparison Venkatesh-Prasad Ranganath Pradip Vallathol Pankaj Gupta Created: 03/15/2012 Compiled: March 16, 2013 Technical Report MSR-TR-2012-87 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 http://www.research.microsoft.com

Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

Compatibility Testing via Patterns-Based Trace Comparison

Venkatesh-Prasad Ranganath Pradip Vallathol Pankaj Gupta

Created: 03/15/2012Compiled: March 16, 2013

Technical ReportMSR-TR-2012-87

Microsoft ResearchMicrosoft Corporation

One Microsoft WayRedmond, WA 98052

http://www.research.microsoft.com

Page 2: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

1 Introduction

In its most simple form, compatibility testing checksif software component A can use or serve anothersoftware component B via published interfaces.1

The most common form of compatibility (testing)is backward compatibility (testing), e.g. can MicrosoftExcel 2010 be used open and manipulate worksheetscreated using Microsoft Excel 2007? As software getsmore modular and extensible, it is necessary to ensurethat the set of components constituting a softwareprogram are mutually compatible. To this end, mostsoftware take proactive measures to allow only theuse of compatible components.

A good example scenario is where users upgradetheir favourite web browser. Upon upgrading theirbrowsers from version V1 to V2, users are notified ofany installed browser plug-ins that need to be dis-abled due to incompatibilities with version V2 of thebrowser or that need to be upgraded to newer ver-sions. These incompatibilities can stem from syntac-tic or semantic changes to the published interfaces ofthe browser.

Of these changes, incompatible syntactic changesto a published interface (e.g. change in function sig-natures) can be easily detected by compiling plug-insagainst version V2 of the browser.

As for semantic changes to a published interface,very few of them (e.g. change in values of an enu-meration type) can be easily detected by compilingplug-ins against version V2 of the browser. Most se-mantic changes will require plug-ins to be executedwith version V2 of the browser and the change to beexercised during the execution.

For example, suppose version V2 of the browserremoved the value Monday from an enumeration typefor weekdays. If none of the decisions in plug-in Adepend on the value Monday, then the behavior ofplug-in A will be unaffected with version V2 of thebrowser. On the other hand, if a decision in plug-inB depends on Monday, then the behavior of plug-in B with version V2 of the browser will most likelybe altered; possibly, leading to malfunction of bothplug-in B and the browser.

1The notion of published interfaces was introduced by Mar-tin Fowler [13].

Beyond such semantic changes, programs can de-pend on any behaviors observable at published inter-faces of other programs.

As an example, consider a C language library Lthat exposes two functions f and g that consume apointer to a structure with a field q as their first ar-gument and require field q to contain value v uponinvocation. In version V1 of L, suppose the effect of fon field q is unspecified and q is unmodified upon re-turning from f. This observation can be exploited tooptimize clients of version V1 of L — when invokingf and g in sequence, assign v to q and then invoke fand g without any intervening assignment of v to q.However, in a subsequent version V2 of L, if the im-plementation of f modifies q, then clients optimizedagainst version V1 of L will fail when operating withversion V2 of L.

We encountered a real-world instance of this sce-nario during the development of Windows 8. To sup-port USB 3.0 protocol in Windows 8, USB teambuilt a new USB 3.0 driver stack from scratch.Since USB 3.0 protocol is backward compatible withUSB 2.0 protocol, USB 3.0 driver stack neededto support USB 2.0 devices along with their de-vice drivers that were built against the existingUSB 2.0 driver stack. Hence, when servicingUSB 2.0 devices, USB 3.0 driver stack was re-quired to mimic the observable behavior of theUSB 2.0 driver stack. In this context, anincompatibility issue (deviation) could be theUSB 3.0 driver stack completes isochronous trans-fer requests at PASSIVE LEVEL interrupt request levelwhile USB 2.0 driver stack always completes suchrequests at DISPATCH LEVEL interrupt request level.Further, such deviations may not affect the ob-servable behaviors of USB 2.0 devices used to testUSB 3.0 driver stack. However, they could possi-bly affect untested USB 2.0 devices which could benumerous given the number of unique USB devicesin the world. So, we needed a way to test for anduncover such subtle incompatibilities between theseUSB driver stacks.

To discover such incompatibilities due to depen-dences on observable behaviors (and ensure compat-ibility), many software shops employ field testing byproviding customers with pre-release versions of their

1

Page 3: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

software (e.g. Windows 8 by Microsoft, Firefox byMozilla). The success of field testing in uncoveringcompatibility issues depends both on the number ofcustomers participating in such testing and the ex-tent to which customers use and exercise various be-haviors of the software. When pre-release versionsof software have low adoption and usage, compatibil-ity issues can go undiscovered until the release of thesoftware. Hence, upon release and wider adoption ofthe software, latent compatibility issues can surfacecausing reliability issues to users and maintenancecosts to software vendors.

Proposed Approach

Inspired by the above problem, we devised a simpledata-driven differential approach to test for compat-ibility. Given two programs with identical publishedinterfaces, the approach relies on clients interactingidentically with these programs via their publishedinterfaces, i.e. the clients submit same requests in thesame order. These interactions (observable behav-iors) are traced and the resulting traces are comparedto detect possible compatibility issues resulting fromboth the presence of new unobserved behaviors andthe absence of previously observed behaviors. Conse-quently, our approach can uncover issues that do notaffect the observed (current) executions but could af-fect yet unobserved (future) executions.

For purpose of comparison, traces are abstractedas sets of structural and temporal patterns (based onexisting notions of patterns that can be mined usingexisting algorithms) and these sets of patterns arecompared using simple set operations, i.e. set union,intersection, and difference. This is the key charac-teristic of our approach.

In terms of guarantees, our approach is unsound —every detected deviation/difference (compatibility is-sue) need not be a bug; hence, it entails manual effortto examine detected deviations and classify them aseither benign deviations or bugs. On the other hand,our approach is complete — all deviations that canbe represented by a pre-defined class of patterns usedto abstract traces will be detected.

As evaluation, during the development of Win-dows 8, we used our approach to test compatibil-

ity between USB 2.0 and USB 3.0 bus drivers inWindows 8. When used within an appropriate workflow, the approach uncovered 25 compatibility bugsin USB 3.0 bus driver by analyzing a pair of tracesfrom only 14 USB 2.0 devices that were functioningwithout errors with both USB bus drivers.

From this effort, we make the following key contri-butions in this manuscript.

1. We propose a differential approach to compat-ibility testing based on patterns-based compar-ing of execution traces. This approach can un-cover compatibility issues stemming from boththe presence of unobserved behaviors and the ab-sence of observed behaviors.

2. We demonstrate the effectiveness of the pro-posed approach in an industrial setting by us-ing it to test compatibility between USB 2.0 andUSB 3.0 bus drivers during Windows 8 develop-ment cycle. We also demonstrate that the ap-proach can detect compatibility issues from suc-ceeding executions.

3. We illustrate that sets of structural and temporalpatterns observed in traces can serve as effectivetrace abstractions to enable software engineeringand maintenance tasks, e.g. compatibility test-ing.

The rest of this manuscript is organized asfollows. Sections 2 and 3 describe compatibil-ity testing and patterns-based trace comparison.Section 4 provides a detailed exposition aboutour experience using patterns-based trace compar-ison to test compatibility between USB 2.0 andUSB 3.0 bus drivers in Windows 8. Section 5 dis-cusses related efforts. Section 6 presents future pos-sibilities.

2 Compatibility Testing

As in all forms of testing, compatibility testing isused to check if a program Pt consumes a given inputx and produces the expected output y, i.e. Pt(x) = y.

2

Page 4: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

The distinction of compatibility testing is that the ex-pected output is the output produced by another pro-gram Pr upon consuming x, i.e. Pt(x) = y = Pr(x).

The above simple view of compatibility testing suf-fices when we are testing if the observed output isidentical to the expected output, e.g. for argument-2, does function abs1 return the same value as func-tion abs2? In many cases, we are interested in testingif the observed value is similar (identical modulo cer-tain differences) to the expected output, e.g. uponconsuming input x, do programs P1 and P2 outputlog files that contain the same messages but not nec-essarily in the same order? To admit such notions ofsimilarity into compatibility testing, we define com-patibility testing as follows.

Definition 1 Given an input x and two programsPr and Pt, Pt and Pr are (α,ψ)-compatible (denotedas Pt ∼α,ψ Pr) if ψ(α(Pt(x)), α(Pr(x))) holds whereα is a transformation function over program outputsand ψ is a binary test predicate over transformationvalues.2

With this definition, different forms of compatibil-ity testing can be described using appropriate com-binations of transformation functions and test predi-cates. For example, the simplest form of compatibil-ity testing based on equality of output (i.e. Pt(x) =Pr(x)) can be described by ∼id,= with identity func-tion as α and equality predicate as ψ.

In this vein, we can describe compatibility testingof programs that output traces (sequences).3 Con-sider programs that consume an input x and producea trace π as output. By the above definition, giventwo programs Pt and Pr that output traces πt and πrupon consuming test input x, Pt is (α,ψ)-compatibleto Pr if ψ(α(πt), α(πr)) holds (as Pt(x) = πt andPr(x) = πr). Hence, the problem of compatibilitytesting based on traces reduces to the problem oftrace comparison under α and ψ.

In many situations, we want to test programsthat consume input sequences and produce output

2Regression testing can be viewed as a form of compatibilitytesting where Pr and Pt are two consecutive versions Pi andPi+1 of the same program P . Further, this definition can begeneralized to other forms of testing.

3We will use the terms trace and sequence interchangeably.

traces. The above definition of compatibility test-ing can be applied in such situations by conditioningthe output traces produced by programs as follows:when a program P consumes an input sequence X =x1, x2, . . . xn produces an output trace π, X should bea subsequence of π and, for every input xk ∈ X, if thecorresponding output yk of P exists in π, then yk fol-lows xk in π, i.e. π[j] = yk =⇒ (∃i.π[i] = xk∧i < j).

The notion of trace similarity determined by α andψ can range from simple equality to subsequenceequivalence (similar to stuttering equivalence [16]).Further, the notion of similarity determines the kindof issues that can be uncovered via compatibility test-ing.

In summary, under a notion of similarity deter-mined by α and ψ, the problem of testing compati-bility between programs based on their output tracesreduces to the problem of trace comparison.

3 Patterns-based Trace Com-parison

In this section, we describe two notions of trace simi-larity. Both these notions of similarity use set equal-ity as the test predicate ψ. The transformation func-tion α in these notions are based on event abstrac-tions and binary linear temporal patterns proposedby Lo et al. in [17].4

From here on, an event e = {a1 = c1, a2 = c2, a3 =c3, . . . , an = cn} is a map from attributes to valuesand a trace t = (e1, e2, . . . , en) is a sequence of events.

3.1 Structural Patterns-based Simi-larity

The most common notion of trace similarity is basedon the presence/absence of events in traces (whileignoring the order of events), i.e. consider traces assets of events and compare these sets. We refer tothis notion as event based trace similarity.

While this notion is simple, it can be ineffectivewhen different attributes of an event have different

4For more details about these patterns and the correspond-ing pattern mining algorithms, please refer to [17].

3

Page 5: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

relevance in different scenarios. For example, whencomparing two execution traces in terms of invokedfunctions, it might suffice to consider views of eventslimited to the attribute capturing function names,e.g. consider {fun=“fopen”} view of the invocationevent {fun=“fopen”, arg1=“passwd.txt”, arg2=“r”,return=0x21}.

From this observation, we propose a notion of tracesimilarity based on the presence/absence of (event)abstractions of events in traces where any non-emptysubset of an event e is an abstraction (view) of e. Werefer to this notion as event abstraction based tracesimilarity.5

In many situations, it is useful to consider data con-straints spanning multiple attributes. For this pur-pose, we propose using the notion of event abstrac-tion with quantification — given an event abstrac-tion e = {a1 = c1, a2 = c2, a3 = c3, . . . , an = cn},e′ = {a1 = vi, a2 = c2, a2 = vi, . . . an = cn} is aquantified abstraction of e if there exists a substitu-tion θ (a non-empty map from variables to values)such that ∀ai.e[ai] = e′[ai] ∨ e[ai] = θ(e′[ai]) wherevis are free variables. Consequently, an attribute isquantified if it is associated with a free variable (asopposed to a value).

These event abstractions are patterns of eventstructures observed in a trace; hence, we refer to theseabstractions as structural patterns. Further, struc-tural patterns with and without quantification arereferred to as quantified structural patterns and un-quantified structural patterns, respectively. Finally,we define the notion of structural patterns-based tracesimilarity as comparing the sets of structural pat-terns observed in traces via set equality.

In other words, compatibility testing based ontraces can be realized with a transformation func-tion that transforms a trace into a set of observedstructural patterns and a test predicate that checksfor set equality.

5In the presence of full domain knowledge, this notion isnot required as the exact data fragments can be extracted fromevents; however, full domain knowledge is seldom available inreal-world settings.

3.2 Temporal Patterns-based Similar-ity

Structural patterns-based trace similarity will be in-effective in situations where traces are identical interms of the structural patterns but differ in theorder of structural patterns. For example, tracest1 = (a, b, c) and t2 = (a, c, b) are identical un-der structural patterns-based trace similarity as bothtraces result in the same set of structural patterns{a, b, c}.

To address this drawback, we propose transforminga trace into a set of temporal patterns composed ofstructural patterns. Of the numerous forms of tem-poral patterns, we consider the following four binarylinear temporal patterns defined in [17].

Given structural patterns A and B are observed inevents of a trace,

1. A∗� B (B

∗� A) is observed in the trace when

an event ei matching A is followed (preceded) byan event ej matching B.6 These are eventuallypatterns.

2. Aa� B (B

a� A) is observed in the trace when

event ei matching A is followed (preceded) by anevent ej matchingB and no event between ei andej matches A. These are alternation patterns.

In the above patterns, either both A andB are quantified or none are quantified. Fur-ther, when A and B are quantified, the as-sociated substitutions resulting from match-ing events are identical, i.e. θA = θB . Forexample, the pair of event abstractions{fun=“fopen”, return=0x21} and {fun=“fclose”,arg1=0x21} match the temporal pattern

{fun=“fopen”, return=v1}∗�

{fun=“fclose”, arg1=v1} under the substitutionθ = {v1 7→ 0x21}. However, the event abstractions{fun=“fopen”, return=0x21} and {fun=“fclose”,arg1=0x23} do not match the same temporal patternas the event abstractions do not share a commonsubstitution.

6Given an structural pattern C, an event e matches C if Cis an abstraction of e.

4

Page 6: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

We shall refer to temporal patterns with and with-out quantification as quantified temporal patterns andunquantified temporal patterns, respectively. Finally,we define the notion of temporal patterns-based tracesimilarity as comparing the sets of all temporal pat-terns observed in traces.

In other words, compatibility testing based ontraces can also be realized with a transformationfunction that transforms a trace into a set of observedtemporal patterns and a test predicate that checks forset equality.

3.3 Discussion

Statistical Similarity The above notions of tracesimilarity is insufficient when the traces being com-pared exhibit the same set of patterns but differ interms of the statistical properties of the observed pat-terns. In such cases, we need to consider patternsalong with their statistical properties and employ no-tions of similarity sensitive to statistical properties.For example, each pattern in a trace can be associ-ated with its frequency in the trace. So, while com-paring two traces, a pattern common to both tracesis deemed as a difference if its frequency is below apreset significance threshold in one trace and abovethe same threshold in the other trace.

Similarly, statistics about the distance betweenevents matching temporal patterns can be used toenable patterns-based performance debugging.

We refer to this notion of similarity based on sta-tistical properties of patterns as statistical similar-ity. While we have explored performance debuggingbased on this notion of similarity, we will not describeit in this manuscript.

Textual Comparison Given two traces, we cancalculate the longest common subsequence (LCS) ofthese traces and identify events from traces that areabsent from the subsequence as the difference be-tween the traces. This would be similar to serializingtraces into text files (e.g. with one line per event)and using your favourite textual differencing tool.While this technique can detect deviations capturedby missing events, it will be inaccurate in identify-ing deviations stemming from temporal orderings of

events that are not part of the longest common sub-sequence. Further, it is unclear if and how can eventabstractions be effectively and efficiently consideredin LCS-based comparison.

Graphical Comparison Given a trace, we cangenerate a succession graph in which nodes representunique events in the trace and directed edges cap-ture the immediate successor relation between events.Hence, given two traces, we can generate and com-pare their succession graphs to detect missing nodesand edges as the difference between these graphs.(This would be identical to diffing sets of all bigramsin given traces.) While this technique can detect is-sues captured by missing nodes (events) and edges(bigrams), it will fail to detect issues that can beonly captured by event subsequences, e.g. the tech-nique will detect presence of (a, b) and (b, c) in trace(a, b, c) but will fail to detect the presence of sub-sequence (a, c). However, this can be remedied byconsidering graph reachability information. Even so,as with textual comparison, it is unclear how to per-form graph diffing with event abstractions.

4 USB Driver CompatibilityTesting in Windows 8: AnExperiment

In this section, we describe our experience usingpatterns-based trace comparison to devise an ap-proach to test compatibility between USB 3.0 driverstack and USB 2.0 driver stack in Windows 8. Thiswas a joint effort with USB team in Windows orga-nization within Microsoft.

We describe this experience in detail to present thenuances involved in using patterns-based trace com-parison for compatibility testing. With these details,other researchers and practitioners should be able toeasily reproduce our experience in other contexts.

4.1 Windows Driver Subsystem

Most of the kernel-mode device drivers on WindowsVista and Windows 7 conform to Windows Driver

5

Page 7: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

Function Driver(Mouse Driver)

Filter Driver

USB 3.0 Controller

USB 3.0 Bus Driver

USB Device(Mouse)

3

2

1

Figure 1: An illustration of how various types ofWDM drivers are stacked when a USB device isplugged into a USB 3.0 port on Windows 8 PC.

Model (WDM). This model supports the followingkinds of drivers.7 (Please refer to Figure 1 for anillustration of how various kinds of WDM drivers areconnected.)

• A bus driver services devices that can havechild devices, e.g. bus controllers, adapters, andbridges. As these are necessary drivers, Mi-crosoft generally provides these drivers for eachtype of bus, e.g. USB and PCI. All communica-tion to devices on a specific bus goes through thecorresponding bus driver.

• A filter driver extends the functionality of a de-vice or intercepts and possibly modifies I/O re-quests and responses from drivers.

• A function driver exposes the operational inter-face of a device to the system, e.g. the devicedriver provided with Microsoft Comfort Curve3000 keyboard.

In WDM, most communications with and betweendrivers is packet-based. Typically, an I/O requestis dispatched to a driver by invoking IoCallDriver

7We use the terms driver(s) and device driver(s) inter-changeably.

routine with an I/O Request Packet (IRP) (a struc-ture in C language) embodying the request. The IRPis delivered to the I/O manager which then forwardsthe request to the appropriate driver. Upon complet-ing a request, the servicing driver modifies the cor-responding IRP (e.g. updates status fields or copiesdata into buffers in the IRP) and signals the comple-tion of the request to the I/O manager by invokingIoCompleteRequest. I/O manager then signals therequesting driver about the completion by invokingthe IoCompletion routine registered for the IRP. Thefields of the IRP both define the type of requests andthe data (both input and output) pertaining to re-quests.

4.2 Problem

As mentioned in Building Windows 8 blog [2], Win-dows 8 supports USB 3.0 protocol with a newUSB 3.0 driver stack that provides a bus driver ded-icated to USB 3.0 controller. Since USB 3.0 driverstack is a clean room implementation, it does notborrow any code and, hence, any behavior from ex-isting USB 2.0 driver stack in Windows 8. Fur-ther, USB 3.0 driver stack exclusively supports de-vices controlled by USB 3.0 controller (connected toUSB 3.0 port) while USB 2.0 driver stack exclusivelysupports devices controlled by USB 2.0 controller(connected to USB 2.0 port).

In the rest of this exposition, we focus on theUSB bus drivers provided by the USB driver stacksas they control the underlying USB controller. Also,we shall refer to USB bus driver as USB driver.

Consider the situation where a user plugs in aUSB 2.0 device into a USB 3.0 port on a com-puter running Windows 8. Since USB 3.0 proto-col is backward compatible with USB 2.0 protocol,the user expects the device to behave as if the de-vice was plugged into a USB 2.0 port and serviced byUSB 2.0 driver. In other words, the observable behav-ior of a device plugged into a USB 3.0 port should beidentical to the observable behavior of the same deviceplugged into a USB 2.0 port.

To enable the above scenario, USB 3.0 driver needsto support USB 2.0 protocol to guarantee behavioralequivalence with USB 2.0 driver. However, it is possi-

6

Page 8: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

ble that existing function drivers could depend on un-published yet observable behaviors of USB 2.0 driver,e.g. USB 2.0 driver zeroes out the PortStatus

bits in the IRP upon failing to service I/O con-trol code IOCTL INTERNAL USB GET PORT STATUS.8

Hence, for compatibility, USB 3.0 driver shouldsupport any unpublished yet observable behaviorsof USB 2.0 drivers, and we need to test theseUSB drivers for equivalence of such behaviors.

A naive approach to test for such equivalence isto exercise USB 3.0 driver with every device and itsfunction driver. However, this approach is prohibitiveas there are over 10 billion USB devices in the worldtoday.

To identify an alternative, we observed that everyfunction driver exposes the functionality of a deviceto the system by interacting with the device via thebus driver. So, it is likely that any deviation in inter-actions between a function driver and the bus drivercould lead to deviations in the observable behaviorof the device. Hence, we tested compatibility be-tween USB drivers by testing for equivalence of inter-actions between function drivers and the USB drivers(at point 3 in Figure 1), i.e. for every request from afunction driver, is the response from USB 3.0 driversimilar to the response from USB 2.0 driver? 9

8Such behaviors can stem from decisions while implement-ing weakly specified parts of USB 2.0 protocol.

9Here are two alternative approaches to test compatibilityof USB drivers.

Approach 1 Check equivalence of on-the-wire interactionsbetween the USB controller and the USB device (at point 1 inFigure 1). While this form of checking can be highly accurate,it can be brittle due to controller specific nuances stemmingfrom weakly specified parts of USB protocol. Also, since theUSB driver does not have direct control over these interactions,it is unclear if this approach will help uncover compatibility is-sues directly stemming from the implementation of USB driver.

Approach 2 Check for equivalence of command-level inter-actions between the USB driver and the USB controller (atpoint 2 in Figure 1). This form of checking can be brittle dueto nuances stemming from the combination of the flexibility ofUSB command language and the implementation of both theUSB controller and the USB driver.

Independent of these details, similar interactions betweenthe bus driver, the controller, and the device do not guaranteesimilar interactions between the function driver and the busdriver. Hence, we chose to test the interactions between thefunction driver and the bus driver, at the highest level in thestack.

For purpose of simplicity, we limited our focus totest compatibility between USB drivers during enu-meration and rundown phases during the lifetime ofa device on Windows, i.e. recognition of a device byWindows and cleanup following the ejection of a de-vice, respectively.

4.3 Solution

Our solution to this problem uses patterns-basedtrace comparison (described in Section 3) with theworkflow outlined in Figure 2. In the following sec-tions, we describe various steps in this workflow.

4.3.1 Trace interactions between drivers

Given a USB 2.0 device, we enabled tracing, pluggedin the device to a USB 2.0 port, waited for the de-vice to be recognized by Windows, ejected the device,waited for the device to be unavailable in Windows,and disabled tracing. We then repeated these stepswith the same device plugged into a USB 3.0 port.

For tracing, we used a customized filter driver (de-veloped by USB team) to capture the interactionsbetween functions drivers and USB drivers and logthese interactions into ETW traces via Event Trac-ing for Windows (ETW) [1].

4.3.2 Mine patterns from traces

The traces collected in the previous step contained13 simple types of events. Of these simple eventtypes, few captured the invocations of driver routines(e.g. IoCallDriver) that enable inter-driver commu-nication, few captured the completions of driver rou-tines, and others captured the arguments to theseroutines. Hence, we combined these 13 simple eventtypes into 6 compound event types that representthe invocation of driver routines along with their ar-guments and the completion of driver routines withtheir return values. Consequently, we preprocessedthe traces to coalesce simple events into compoundevents. During preprocessing, to ease further pro-cessing, we synthesized certain information (e.g. I/Ocontrol codes (IOCTL)) and captured them in few(< 5) synthesized event attributes. Since the events

7

Page 9: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

in ETW are structured, we also flattened access pathsto fields. After preprocessing and flattening of ac-cess paths, a total of 361 attributes existed (includingfew synthesized attributes) across 6 compound eventtypes.

To curb the explosion of structural patterns duringmining, we employed domain knowledge by consult-ing a developer from the USB team. Specifically, outof 361 attributes, we identified 108 attributes thatcould be ignored. Of the remaining 253 attributes, weidentified 29 attributes as necessary (i.e. they shouldoccur in all structural patterns of an event) and 224attributes as optional (i.e. not necessary). Also, weidentified 75 attributes that should not be quantified;of these, 23 attributes were identified to be abstractedas either NULL or non-NULL.

In terms of quantification, we identified 150 at-tributes that should always be quantified. For ex-ample, since we were interested in checking if similarIRPs are processed similarly by both driver stacks,we need not have to mine patterns involving eventspanning different IRPs; hence, irpId attribute thatuniquely identifies the source IRP of an event was al-ways quantified. Further, when quantification of at-tributes is used to capture data flow between eventsparticipating in a temporal pattern, non-existentdata flow can be captured due to representationalequivalence as opposed to semantic equivalence. Tocurb this noise, based on domain knowledge, by wayof configuration, we considered only data flows be-tween 17 attribute pairs involving 26 different at-tributes in addition to data flow between same at-tributes occurring in different events.

With the above setup, we mined every structuraland temporal patterns (of the forms described in Sec-tion 3) occurring in the collected traces. In unquanti-fied form, we considered only patterns involving nec-essary attributes. In quantified form, we consideredonly patterns involving all necessary attributes andup to three optional attributes.

For mining, we used Tark, a toolkit to mine thepatterns described in Section 3 [3]. Since Tark im-plements pattern mining algorithms described in [17],please refer to [17] for details about these mining al-gorithms.

Note While using domain knowledge does incur ad-ditional cost, this is a one time cost incurred whenthe approach is tailored and deployed for a specificcontext. Further, the benefits of using domain knowl-edge can be huge, e.g. knowledge about the fields thatcan be ignored reduced the ceiling on the number ofstructural patterns from 2361 to 2253.

4.3.3 Calculate unique patterns (deviations)

For each USB 3.0 trace, we calculated two setsof unique patterns (or deviations) based on allUSB 2.0 traces in our corpus.10

The first set was composed of unique USB 2.0 pat-terns observed in every USB 2.0 trace but not ob-served in the given USB 3.0 trace — the differ-ence between the intersection of pattern sets of ev-ery USB 2.0 trace and the pattern set of the givenUSB 3.0 trace. These patterns identify (always) ob-servable behaviors of USB 2.0 driver that were notexhibited by USB 3.0 driver.

The second set was composed of uniqueUSB 3.0 patterns observed in given USB 3.0 tracebut in none of the USB 2.0 traces — the differencebetween the pattern set of given USB 3.0 trace andthe union of pattern sets of every USB 2.0 trace.These patterns identify extraneous observablebehaviors exhibited only by USB 3.0 driver.

When executed with USB 3.0 driver, a functiondriver can fail due to both these patterns — a driverdependent on unique USB 2.0 patterns could fail duethe absence of a pattern while a driver not capable ofhandling unique USB 3.0 patterns could fail due thepresence of a pattern.

4.3.4 Shrink unique pattern sets

Given the number of patterns mined from each tracewas huge (as shown in Table 2), we employed thefollowing techniques to shrink the sets of patterns byeliminating redundant patterns.

Partitioning When a trace contains a uniquestructural pattern, the trace will also contain numer-ous unique temporal patterns that involve this unique

10We did this to reduce the number of false positives.

8

Page 10: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

structural pattern. From the perspective of detect-ing unique deviations, such temporal patterns do notidentify deviations that are different from the devi-ations detected by the contained unique structuralpattern. Hence, we removed such temporal patternsfrom the pattern sets.

Simplification By definition of the temporal pat-terns in Section 3, the presence of an alternation pat-

tern (e.g. Aa� B) in a trace implies the presence

of corresponding eventually pattern (e.g. A∗� B) in

the trace. Hence, we removed eventually temporalpatterns from a pattern set if their alternation coun-terparts were present in the pattern set.

In a similar vein, by way of construction, the exis-

tence of a complex pattern (e.g. A ∧ B∗� C) im-

plies the existence of simpler constituent patterns

(e.g. A∗� C and B

∗� C). Hence, either complex or

simple patterns can be presented without any loss ofinformation. Favoring simplicity, we removed com-plex patterns from a pattern set if all of their simplerconstituent patterns were present in the pattern set.

Compaction If temporal patterns of the form

Aa� B and A

a� B were present a pattern set, we

replaced them with a single temporal pattern of theform A

a←→ B with the meaning “an event matchingA will be followed by an event matching B with nointervening events that match A or B.”

4.3.5 Report deviations to the developer

As the final step, for each device, a developer fromUSB team examined the resulting unique patternsand classified them as either benign deviations orbugs. All bugs were entered into the Windows bugrepository for triaging purposes.

Reducing False Positives To curtail falsepositives (due to recurring benign deviations),we applied user-defined filters (e.g. ignore pat-terns in which IOCTLType field is equal toURB FUNCTION SELECT CONFIGURATION). These fil-ters were often based on patterns observed while ex-

amining previous test results. The user-defined fil-ters were saved and reused while examining resultsfrom subsequent tests. In addition, we suppressedpatterns that were observed in previous tests as theywere classified as either benign deviations or bugs.This is depicted by the dashed line in Figure 2. Werefer to these patterns as known patterns.

Aiding Diagnosis For all unique patterns, weidentified events that matched unique patterns alongwith events that did not match unique temporal pat-terns. The developer then started diagnosing the is-sue starting at these events.

4.4 Evaluation

For this evaluation, we collected a pair of traces— one with USB 2.0 driver and another withUSB 3.0 driver — from 14 different USB 2.0 devicesand compared these traces as described in the previ-ous section as a test of compatibility.

Based on this data set, we evaluated the effec-tiveness, precision, and cost of our approach. Ta-ble 1 provides the breakdown of deviations and bugsuncovered in this evaluation. Table 2 provides thebreakdown of mined patterns and costs of testing asobserved in this evaluation.

4.4.1 Effectiveness

In this data set, a developer fromUSB team identified 25 deviations asrepresenting compatibility bugs betweenUSB 2.0 driver and USB 3.0 driver. Of these25 bugs, 14 bugs were based on unique structuralpatterns and 11 bugs were based on unique tem-poral patterns (involving only common structuralpatterns). Following is a description of few of thedetected bugs.

• When USB 2.0 driver fails to serviceIOCTL INTERNAL USB GET PORT STATUS request,thedriver zeroes out bits ofPortStatus field. However,USB 3.0 driver does not zero out these

9

Page 11: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

bits. This bug was based on a unique structuralpattern.

• Upon completing anisochronous transfer request,USB 3.0 driver set the status field of theisochronous packet to OxFFFFFFFF. However,this was not the case with USB 2.0 driver. Asin the previous case, this bug was based on aunique structural pattern.

• USB 2.0 driver completed isochronoustransfer requests at DISPATCH LEVEL

interrupt request level. However,USB 3.0 driver completed similar requests atPASSIVE LEVEL interrupt request level. Thisbug was based on a unique structural pattern.

• A USB device can have multiple operational con-figurations along with corresponding interfaces,and one of the configurations is selected whileenumerating the device. When an I/O requestto select a configuration for a device was submit-ted, USB 3.0 driver failed to communicate thecorresponding interface in its response. This de-viation was based on a unique temporal patternwith interfaceHandle field (attribute) remain-ing unchanged across the two events supportingthe pattern.

• When a USB device is not is use, its functiondriver can notify the USB driver that the deviceis idle and the device can be suspended or putin low power state. Upon completing such anI/O request corresponding to such a notification,USB 3.0 driver did not change PendingReturnedfield in the IRP. This deviation was based on aunique temporal pattern.

USB 2.0 Trace

Mine Patterns

3.0 Patterns

USB 3.0 Trace

Mine Patterns

2.0 Patterns

Domain Knowledge

FilterDiff

Known Patterns

Filtered Diff Patterns

Bug Patterns

Benign Patterns

Developer

Capture Trace

Capture Trace

User Def Filters

ShrinkDiff

Generate Diff

Figure 2: Work flow to perform compatibility testingusing patterns-based trace comparison.

10

Page 12: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

4.4.2 Precision

Our solution started out with a high number of falsepositives — out of 478 deviations reported for device1, 465 deviations were false positives (see False +veand Reported columns in Table 1). So, as we testedmore devices, we collected and saved false positivesto filter them out from subsequent test results (as de-scribed in Section 4.3.5). Consequently, the numberof false positives dropped to less than 100 in tests cor-responding to devices 2 thru 14 and to less than 10 in8 out of 13 tests. Hence, we conjecture that the falsepositives reported by our approach will decrease asthe number of devices used for compatibility testingincreases.

In addition, we collected a total of 7(=2+3+2)user-defined filters while testing with devices 2, 5,and 14. In retrospective, few of these filters couldhave been injected as domain knowledge during pat-tern mining.

In terms of curtailing the number of deviations pre-sented to the developer, simplification helped reducethe number of detected deviations by at least a fac-tor of 10. Similarly, compaction helped reduce thenumber of simplified deviations by a factor of 2. (Seecolumns Simplified and Compaction in Table 1.)

Revisiting the issue of number of false positives,let us consider the cost of compatibility testing. Ob-serve that the bugs were detected from traces of de-vices that functioned without errors with both busdrivers. Instead, if we wanted to detect the samebugs by observing devices failing due to these bugs,then we would need to test both USB bus driverswith every USB device in the world. This wouldamount to testing with a fraction of more than 10billion USB devices!! In contrast, with our approach,the developer spent less than 2 hours in many casesto examine a non-empty set of deviations resultingfrom a test (device); in very few cases, the developerspent up to a day to examine a set of deviations.So, comparing the cost of testing with every uniqueUSB device to the cost of developer spending 2 hoursto sift through less than 100 false positives per test,the number of false positives is insignificant.

4.4.3 Cost

While the cost of capturing a pair of traces for adevice was in the order of few minutes, the time tomine quantified patterns from these traces (rangingfrom 200K to 500K patterns per trace) varied from10 minutes to 100 minutes depending on the lengthof the trace and the average number of attributes perevent (see Time and Patterns columns in Table 2).Considering only the time taken to difference patternsets and to simplify the difference, the cost for auto-matically detecting deviations from a pair of traceswas 2-3 minutes for unquantified patterns (see DiffTime columns in Table 2). However, the cost of de-tecting deviations based on quantified patterns was5-12 minutes for most traces with few exceptions of15, 20, and 45 minutes.

Given that our approach is automated and thereare no alternative approaches/techniques to detectdeviations that do not affect the device under test,we believe the cost of approach is reasonable.

11

Page 13: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

Num

ber

of

Dev

iati

ons

Num

ber

of

Bugs

Dev

ice

Know

nD

etec

ted

Sim

plified

Com

pact

edR

eport

edF

als

e+

ve

Str

uct

ura

lT

emp

ora

l

10

9844

932

478

478

11+

454

6/9

4/4

2*

932

2545

121

63

15

0+

11

1/1

1/3

3965

743

41

21

41+

00/0

1/3

4965

1372

67

34

21+

10/0

0/0

5*

965

26118

1114

571

55

26+

29

0/0

0/0

62141

26126

1054

541

00+

00/0

0/0

72141

2320

84

44

00+

00/0

0/0

82141

27804

1185

608

21+

01/1

0/0

92141

34985

413

217

115

2+

96

2/14

2/3

10

2141

51556

429

231

59

15+

41

1/1

2/2

11

2141

695

35

18

00+

00/0

0/0

12

2141

1372

67

34

00+

00/0

0/0

13

2141

3315

122

72

24

19+

41/1

0/0

14*

2141

9299

103

54

30+

02/3

0/0

Tab

le1:

Bre

akd

own

ofd

evia

tion

sd

etec

ted

by

ou

rap

pro

ach

(in

ord

er).

For

ad

evic

e/te

st,

each

row

pro

vid

esth

enu

mb

erof

kn

own

dev

iati

ons,

det

ecte

dd

evia

tion

s,re

por

ted

dev

iati

on

s,fa

lse

posi

tive

s,an

db

ugs

of

vari

ous

sort

s.X

+Y

den

ote

sX

stru

ctu

ral

pat

tern

san

dY

tem

por

alp

atte

rns.

A/B

den

otes

Ab

ugs

wer

eun

iqu

eou

tof

Bb

ugs.

Dev

ices

/T

ests

at

wh

ich

new

filt

ers

wer

eco

llec

ted

are

mar

ked

wit

h(*

).K

now

ndev

iati

ons

are

all

sim

pli

fied

dev

iati

on

sid

enti

fied

as

diff

eren

cein

pre

vio

us

test

s.

12

Page 14: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

4.4.4 Limitations

As demonstrated, the approach can only detect devi-ations that can be represented as structural patternsor binary linear temporal patterns both in unquan-tified and quantified forms. However, this limitationcan be addressed by mining and using richer patterns,e.g. longer temporal patterns.

4.5 Threats To Validity

While the above results are promising, they are basedon a single experiment. Further, we neither consid-ered nor controlled the effects of various latent fac-tors on the experiment. Few latent factors are thesize of the interface (e.g. number of functions), dataflow properties at the interface, value domains of thedata consumed/produced by the interface, and (non-)existence of a well-defined protocol governing theobservable behavior at the interface. Also, from theavailable data, it is hard to discern if and to what ex-tent did the effectiveness of the approach depend onthe domain knowledge. More experimentation willhelp (in)validate these threats.

5 Related Work

The number of efforts in the space of software testingis huge — even to fairly cite a few efforts here — tothe extent that there are venues dedicated to softwaretesting, e.g. ICST [4], ISSTA [5], TAP [6]. Most ofthese efforts rely on software tests that embody well-defined expected outcomes to automatically provideconclusive test verdicts. In comparison, our approachrelies on execution traces of both the system undertest and the reference implementation to automati-cally detect a class of behavioral (and possibly per-formance) deviations. In other words, our approachautomatically hoists observable behaviors in the ref-erence implementation as well-defined expected out-comes.

In terms of using event sequence patterns as a coreidea to enable software testing, Dallmeier et al. [10]used method call sequences as trace features to pre-dict and localize defects by comparing traces from

passing and failing executions of test cases. How-ever, they employed method call sequences based onn-grams [18] while we used patterns spanning acrossevents. In terms of fault localization, Dallmeier etal. use differing method call sequences to identify andrank likely defective classes while we present eventsthat match deviating patterns as likely symptoms.

Recently, Beschastnikh et al. [9] used unquantifiedtemporal invariants/patterns satisfied by logs to au-tomatically construct a graph model of a system. Incomparison, we have considered both unquantifiedand quantified variants of temporal patterns to modelbehaviors captured in traces.

Beyond software testing, there have been numer-ous efforts [14, 15, 20, 21, 22, 23, 25] to mine variousfeatures from system logs and then monitor live sys-tems for absence of these features. Similar to soft-ware testing efforts, few of these efforts have usedn-grams and state machines as trace features whileother efforts have employed features based on sta-tistical properties of traces such as relative frequencyand correlation of events. Ignoring the specific classesof patterns and the corresponding mining techniques,our effort is similar to these efforts in terms of usingpatterns and pattern mining techniques as featuresand feature extraction techniques, respectively.

In a similar vein, Barringer et al. used humanspecified quantified (parameterized) temporal pat-terns with logs to enable postmortem runtime ver-ification of flight software for NASA’s recent Marsrover mission [8]. In comparison, we use automat-ically mined quantified temporal patterns to detectboth the presence of new behaviors and the absenceof previously existing behaviors when testing (com-patibility of) programs.

In terms of trace comparison, data mining com-munity has proposed and used various edit dis-tances (e.g. Hamming, Levenshtein) and sequence-based patterns to compare sequences [11]. Miranskyyet al. [19] identified differences between traces by it-eratively differencing various inter-event abstractionsof traces (e.g. set of function calls, caller-callee rela-tion, sequence of function calls). In comparison, ourapproach is similar to these efforts with the differencebeing the choice of class of structural and temporalpatterns [17] used to abstract and compare traces.

13

Page 15: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

While we were pursuing this effort, Yang and Evans[24] also explored the use of temporal patterns (prop-erties) observed in program logs to identify behav-ioral differences between programs. Besides the oper-ational differences in terms of the underlying miningalgorithms and related cost-precision trade-offs, theymined nine types of unquantified patterns to char-acterize the traces in their evaluation while we usedonly four types of quantified patterns. In addition,we also devised a simple yet efficient feedback basedwork flow to effectively deal with false positives.

6 Future Work

Traditional regression testing relies on passing/failingof existing tests. It does not detect behavioral devi-ations that do not affect the outcome of tests. How-ever, with execution traces of regression tests, ourtechnique can be applied to detect such behavioraldeviations. Also, in case of failing executions, ourapproach can be used to detect deviations betweenpassing and failing executions to aid fault localiza-tion and debugging of failures.

Given the effectiveness of the proposed approach totest compatibility between USB bus drivers, we con-jecture that the approach can be used in the contextof other kinds of software, e.g. desktop and web ap-plications, libraries (similar to [10]). Consequently, itwill be interesting to explore the generality of the ap-proach. Also, it will interesting to identify the char-acteristics of the software under test that make theapproach effective, e.g. will the approach be effectiveonly when the interface level interaction with the soft-ware under test can be described by a protocol?

For USB compatibility testing, we used a notion oftrace similarity based on all temporal patterns satis-fied by traces (while considering limited sorts of eventabstractions). While this notion sufficed for shortUSB traces (most often less than 10,000 events), itmight not suffice for longer traces, e.g. traces withmore than 100,000 events. Depending on the appli-cation scenario, the cost of mining patterns requiredby this notion of trace similarity can be prohibitivelyhigh. In such cases, the cost can be reduced by adopt-ing various constraints (e.g. limiting number of event

abstractions) to identify a small number of patterns(e.g. only frequent patterns) as opposed to all pat-terns. Consequently, we will need new notions oftrace similarity that are sensitive to these constraintsand are helpful in reasoning about traces.

In this effort, equality constraint was used to relateattributes and values while defining events and eventabstractions. However, many scenarios may benefitfrom a richer set of constraints (e.g. <,≤) in termsof characterizing the traces both succinctly and accu-rately using patterns. Such scenarios can be enabledby extending the definition of event abstractions de-scribed in Section 3 and [17] with richer constraints.One of the key challenges with such extensions willbe to deal with the explosion of event abstractions.

The proposed approach relies on sets of binary lin-ear temporal patterns. However, it could be triviallymodified to use sets of longer linear temporal pat-terns or sets of small finite state machines that col-lectively describe the event patterns in traces.11 Insuch a setting, it would be interesting to explore statemachine mining techniques (similar to [7]) that canadmit event abstractions as described in Section 3.

As more and more systems embrace service-oriented architecture and operations engineers relyon log analysis as a primary tool to reason aboutruntime behavior of such systems, the need for onlinetechniques for anomaly detection will increase. Whilewe have explored an approach to off-line anomaly de-tection by comparing sets of patterns mined fromtraces, it would be interesting to extend the pro-posed approach and explore new approaches for on-line anomaly detection.

Moving beyond anomaly detection, the set of struc-tural patterns and linear temporal patterns satisfiedby traces can be used detect similarity between tracesand answer clustering and classification questions —Can a set of traces be partitioned into subsets of sim-ilar traces? Can a trace be classified as belonging to aspecific set of a partition of traces? These questionscan be trivially answered by using patterns observedin traces as features and employing existing clustering

11N-ary linear temporal patterns can be easily constructedby combining m-ary linear temporal patterns under certainrules of combination and then checked against traces to calcu-late their statistical properties.

14

Page 16: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

algorithms such as hierarchical clustering or k-meansclustering [11, 12]. At this time, we are exploring thisdirection to enrich recommendation systems by usinglogs and to improve test prioritization/selection; thusfar, the initial results have been promising.

7 Summary & Conclusion

We have proposed an approach to test compatibilitybetween two programs with identical published inter-faces in terms of their observable behaviors at theseinterfaces — test for both the absence of previouslyobserved behaviors and the presence of previously un-observed behaviors. The approach abstracts traces ofobservable behaviors as sets of patterns and these setsare compared for differences. Existing pattern min-ing techniques are used to abstract traces into sets ofstructural patterns and linear temporal patterns. Wehave demonstrated the effectiveness of the approachby using it to test compatibility between USB 3.0 andUSB 2.0 bus drivers in Windows 8 and identifying 25compatibility bugs.

Based on this experience, we believe that struc-tural and temporal patterns based trace abstractionis an invaluable tool to tackle common yet relevantsoftware development and maintenance problems in-volving trace/log data.

Acknowledgments

We thank Randy Aull, Robbie Harris, JaneLawrence, and Eliyas Yakub from the WindowsUSB team in Microsoft for their help and invaluablefeedback during this effort. We thank Tom Ball andPeter Shier for initiating this effort. Lastly, we thankAnkush Desai and Jithin Thomas for their feedbackon the manuscript. During this effort, Pradip Val-lathol was a developer at Microsoft Research, India.

References

[1] Event tracing for windows. http:

//msdn.microsoft.com/en-us/library/

windows/desktop/bb968803(v=vs.85).aspx,2000.

[2] Building robust USB 3.0 support. http://

blogs.msdn.com/b/b8/archive/2011/08/22/

building-robust-usb-3-0-support.aspx,2011.

[3] Tark: Mining linear temporal rules.http://research.microsoft.com/en-us/

projects/tark/, 2011.

[4] International Conference on Software Testing,Verification and Validation. http://icst2012.soccerlab.polymtl.ca/, 2012.

[5] International Symposium on Software Testingand Analysis. http://crisys.cs.umn.edu/

issta2012/, 2012.

[6] Tests and Proof. http://lifc.univ-fcomte.

fr/tap2012/, 2012.

[7] Glenn Ammons, Rastislav Bod́ık, and James R.Larus. Mining specifications. In Proceedings ofthe 29th ACM SIGPLAN-SIGACT symposiumon Principles of programming languages, 2002.

[8] Howard Barringer, Alex Groce, Klaus Havelund,and Margaret Smith. Format analysis of log files.Journal of Aerospace Computing, Information,and Communication, 7, 2010.

[9] Ivan Beschastnikh, Yuriy Brun, Sigurd Schnei-der, Michael Sloan, and Michael D. Ernst.Leveraging existing instrumentation to automat-ically infer invariant-constrained models. InESEC/FSE’11: Proceedings of the 19th ACMSIGSOFT symposium and the 13th Europeanconference on Foundations of Software Engi-neering, 2011.

[10] Valentin Dallmeier, Christian Lindig, and An-dreas Zeller. Lightweight defect localization forjava. In Proceedings of 19th European Confer-ence on Object-Oriented Programming, 2005.

[11] Guozhu Dong and Jain Pei. Sequence Data Min-ing. Springer Science+Business Media, LLC,2007.

15

Page 17: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

[12] Richard O. Duda, Peter E. Hart, and David G.Stork. Pattern Classification. Wiley-Interscience, 2000.

[13] Martin Fowler. Public versus published inter-faces. IEEE Software, 19:18–19, 2002.

[14] Qiang Fu, Jian-Guang Lou, Yi Wang, and JiangLi. Execution anomaly detection in distributedsystems through unstructured log analysis. InProceedings of Ninth IEEE International Con-ference on Data Mining, 2009.

[15] Mark Gabel and Zhendong Su. Online inferenceand enforcement of temporal properties. In Pro-ceedings of ACM/IEEE 32nd International Con-ference on Software Engineering, 2010.

[16] Edmund M Clarke Jr., Orna Grumberg, andDoron A Peled. Model Checking. The MIT Press,1999.

[17] David Lo, Ganesan Ramalingam,Venkatesh Prasad Ranganath, and KapilVaswani. Mining quantified temporal rules:Formalism, algorithms, and evaluation. In Pro-ceedings of the 2009 16th Working Conferenceon Reverse Engineering, 2009.

[18] Christopher D. Manning and Hinrich Schuetze.Foundations of Statistical Natural LanguageProcessing. The MIT Press, 1999.

[19] A.V. Miranskyy, M. Davison, N.H. Madhavji,M. Wilding, M.S. Gittens, and D. Godwin. Aniterative, multi-level, and scalable approach tocomparing execution traces. In Proceedings ofthe the Sixth joint meeting of the European soft-ware engineering conference and the ACM SIG-SOFT symposium on The Foundations of Soft-ware Engineering (FSE), 2006.

[20] Adam J. Oliner, Alex Aiken, and Jon Stearley.Alert detection in system logs. In Proceedings ofEighth IEEE International Conference on DataMining (ICDM), 2008.

[21] Jiaqi Tan, Xinghao Pan, Soila Kavulya, RajeevGandhi, and Priya Narasimhan. Salsa: Ana-lyzing logs as state machines. In Proceedings of

First USENIX Workshop on the Analysis of Sys-tem Logs (WASL), 2008.

[22] Yi-Min Wang, Chad Verbowski, John Dunagan,Yu Chen, Helen J. Wang, Chun Yuan, and ZhengZhang. Strider: A black-box, state-based ap-proach to change and configuration managementand support. In Proceedings of the 17th USENIXconference on System administration, 2003.

[23] Wei Xu, Ling Huang, Armando Fox, David Pat-terson, and Michael Jordan. Detecting large-scale system problems by mining console logs.In Proceedings of the ACM SIGOPS 22nd sym-posium on Operating systems principles, 2009.

[24] Jinlin Yang and David Evans. Automatic infer-ence and effective application of temporal speci-fications. In David Lo, Siau-Cheng Khoo, JiaweiHan, and Chao Liu, editors, Mining SoftwareSpecifications: Methodologies and Applications,chapter 8. CRC Press, 2011.

[25] Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li,Zheng Zhang, Yi-Min Wang, and Wei-YingMa. Automated known problem diagnosiswith event traces. In Proceedings of the FirstACM SIGOPS/EuroSys European Conferenceon Computer Systems, 2006.

A Patterns and Log Analysis

While log analysis has been and is a common toolused by system administrators and operation engi-neers, it is lately becoming a common tool for sup-port engineers and even developers to diagnose errorsin remote applications, e.g. server running at a cus-tomer site.

Errors in server applications can be transient(e.g. they may be dependent on the workload) andthe exact point in time at which the error occurred ishard to track. Further, the cause of the error may betemporally distant from the error; hence, the execu-tion history of the application is required to diagnosesuch errors. Under these circumstances, traditionaldebugging would amount to running a good chunk

16

Page 18: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

of the application under the supervision of and in-terruption from a debugger. This can considerablyhamper the performance of live systems (and possi-bly suppress the error).

As an alternative, support engineers work with cus-tomers to collect various logs from a failing system.Then, engineers analyze these logs to identify logentries corresponding to both symptoms and possi-ble root causes of failures. Logs collected from livesystems are often huge, and the task of analyzingsuch logs turns into “a search for the needle in thehaystack”.12

All of the above observations, reasons, and ap-proaches are equally applicable to diagnosing errorsin cloud applications.

This situation can be addressed by leveraging ma-chine learning techniques such as clustering and clas-sification [12]. Specifically, given a partition of logsbased on similar errors (causes), classification can beused to detect likely errors (causes) in a new log byclassifying the log as belonging to a set of the par-tition. In the presence of set of unpartitioned logs,clustering can be used to identify a partition of logsbased to similar errors (causes). As with compatibil-ity testing, trace comparison can be used to detectanomalies by comparing logs from successful execu-tions to logs from failing executions.

To enabled the above solutions, features need tobe extracted from logs. Most often, events along withtheir statistical properties both in isolation (e.g. pres-ence/absence, frequency) and in combination withother events (e.g. relative frequency, correlation) canbe used as features. In the same vein, structuralpatterns along with their statistical properties as de-scribed in Section 3.1 can be used as features of logs.

While features independent of the ordering ofevents are cheaper to extract and easy to understand,they most often have relatively lower discriminatorypower than features involving the ordering of events.As the accuracy of machine learning techniques de-pends on the discriminatory power of features, in thecontext of log analysis, it would be better to considerfeatures that are based on ordering of events and are

12Of course, we are assuming that the engineer knows theneedle he/she is looking for. In many cases, this is not true.

easy to comprehend. In this vein, linear temporalpatterns along with their statistical properties as de-scribed in Section 3.2 can be used as features of logs.

By representing logs as a set of patterns, distancemeasures such as Jaccard distance [12] can be usedto measure distances between logs (and sets of logs)for the purpose of clustering and classifying logs.

In short, both structural patterns and linear tem-poral patterns used in this exposition can be usedto enable automated log analysis via traditional ma-chine learning techniques such as classification andclustering.

17

Page 19: Compatibility Testing via Patterns-Based Trace Comparison€¦ · 1The notion of published interfaces was introduced by Mar-tin Fowler [13]. Beyond such semantic changes, programs

USB

2.0

Tra

ces

USB

3.0

Tra

ces

Diff

Tim

e

IdN

o.

of

Unquant

Min

ing

Quant

Min

ing

No.

of

Unquant

Min

ing

Quant

Min

ing

Unquant

Quant

Even

tsT

ime

Patt

erns

Tim

eP

att

erns

Even

tsT

ime

Patt

erns

Tim

eP

att

erns

Patt

erns

Patt

erns

1243

94.7

522108

673.7

7208084

411

84.5

120468

690.0

7205212

106.1

4393.5

2

2163

40.2

620804

827.4

7205592

211

39.4

122172

851.4

1207244

106.2

7404.5

9

3163

31.9

218972

843.4

9205064

2363

67.5

724812

3305.2

6506772

105.5

0635.9

7

418017

102.32

23112

4308.6

7504332

183

18.1

015628

1583.3

6479052

103.3

9481.5

3

5535

27.9

420542

1845.0

6430108

763

28.1

624068

1129.9

2406492

142.8

8663.9

7

61329

51.5

224542

1951.9

1437012

1779

29.8

421980

1508.2

8402836

141.5

1744.6

7

7153

29.9

013900

1826.5

9472732

239

26.3

615044

1091.7

5298080

102.2

0413.4

6

8187

28.7

217092

1667.7

8473448

13027

89.39

25756

5121.93

410092

170.66

2729.10

91009

27.3

515236

1719.2

3468040

1441

37.1

426606

1247.5

1438384

147.1

7900.3

0

10

12511

83.3

523918

5842.77

436108

177

21.4

720592

1141.7

7512104

104.8

51214.3

8

11

183

31.1

615044

1213.2

8297280

153

14.3

313900

1061.3

3473540

103.8

8426.5

4

12

1751

34.2

120334

1628.7

9429100

191

18.2

617092

1099.0

2479736

103.1

2452.3

7

13

181

22.3

220726

1382.1

0536980

793

19.0

615236

1472.6

8475920

105.1

6541.8

2

14

383

22.9

219496

1439.6

7362416

9411

56.2

323112

3100.7

3513532

105.7

7763.6

3

Tab

le2:

Dat

aab

out

min

ing

and

diffi

ng

trac

esin

ou

rex

per

imen

t.E

ach

row

pro

vid

esdata

for

both

aU

SB

2.0

trace

an

da

US

B3.

0tr

ace

ofa

dev

ice.

For

each

trac

e,it

sle

ngth

(in

even

ts),

min

ing

tim

e(i

nse

con

ds)

,th

enu

mb

erof

min

edp

att

ern

sof

vari

ous

sort

s,an

dp

atte

rnd

iffer

enci

ng

tim

e(i

nse

con

ds)

are

pro

vid

ed.

Th

ela

rges

tnu

mb

erin

each

colu

mn

isit

ali

cize

d.

18