130919 jim cordy - when is a clone not a clone

Preview:

DESCRIPTION

Software clone, detection, empirical studies, validation

Citation preview

School of Computing

Kingston, Canada

Contextualized Analysis of Web Services

James R. Cordy

David B. Skillicorn

Douglas Martin

Scott Grant

When is a Clone not a Clone? (and vice-versa)

Motivation �  The Personal Web

�  Rapidly growing number of web services makes it increasingly difficult to find and choose the right ones

�  Need a quick and convenient way to find alternatives

�  Hand tagging impractical – automation is needed!

�  Automation �  Similarity detection techniques offer solutions!

�  Code clone detection from software engineering research can find similar code fragments – why not similar services?

�  Topic models from data mining research can find text documents with similar semantics – why not similar services?

Motivation

Web Service Similarity �  Web services are stored in

service registries, containing WSDL service description files

�  Could apply clone detection to entire service descriptions

�  But what we really want are similar service operations

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<complexType name=“Stock”> <sequence> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType >

Let’s try it!

<complexType name=“Stock”> <sequence> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >

<operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation>

<operation name="GetTopicBinaryChartCustom"> <input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>

How about these?

So what went wrong?

�  At this point we thought maybe our idea wasn’t going to work

�  Maybe clone detection can’t help with web service discovery?

�  But why? What’s so special about WSDL?

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

�  a <portType> element where the operations are declared;

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

�  a <portType> element where the operations are declared;

�  <message> elements corresponding to inputs, outputs and faults of the operations;

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

�  a <portType> element where the operations are declared;

�  <message> elements corresponding to inputs, outputs and faults of the operations;

�  and a <types> element containing an XML Schema that defines the data and structure types used in the messages

Web Service Description Language (WSDL)

�  This simple example service has two operations:

Web Service Description Language (WSDL)

�  This simple example service has two operations:

�  ReserveRoom

Web Service Description Language (WSDL)

�  This simple example service has two operations:

�  ReserveRoom

�  GetAvailableRooms

Web Service Description Language (WSDL)

�  WSDL service description files contain descriptions of the operations that a web service has to offer

�  But the pieces of each operation’s own description are scattered over different parts of the WSDL file

�  Difficult to identify complete units to analyze and compare

The Problem

�  This poses a problem for analysis techniques:

�  Operations cannot easily be compared for similarity using clone detectors, because there are no contiguous fragments to compare

�  And they cannot be analyzed using data mining topic models, because there are no separate complete documents to generate a model from

Our Solution �  Our solution is to contextualize the original

<operation> elements, to create self-contained operation descriptions �  We use source transformation to inline remote

information from the context into the elements that reference or depend on them

�  We call these contextualized WSDL operations Web Service Cells, or WSCells �  The first example of a new kind of clone detection:

contextual clones

Contextualizing WSDL Operations

Contextual Clone Detection

An Experiment �  We have run an experiment to investigate the

difference between clone detection on WSCells and original raw operations

�  Two sets of WSDL service description files: 1,100 operations and 7,500 operations

�  Compared NICAD clone detector results for each set at various near-miss difference thresholds

0% = exact clone, 10% = 1 line in 10 different, and so on

An Experiment �  Number of clones decreases with WSCells

Difference  Threshold  

Clone  Pairs  in  Set  1   Clone  Pairs  in  Set  2  

Originals   WSCells   Originals   WSCells  

0.0   852   705   1434   1066  

0.1   852   734   1434   1228  

0.2   879   775   1438   1637  

0.3   884   813   1469   1637  

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<complexType name=“Stock”> <sequence> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType >

<complexType name=“Stock”> <sequence> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >

�  Reduction in false positives

�  Number of clone classes can increase with WSCells

Difference  Threshold  

Clone  Classes  in  Set  1   Clone  Classes  in  Set  2  

Originals   WSCells   Originals   WSCells  

0.0   169   187   587   433  

0.1   169   139   587   499  

0.2   172   142   589   631  

0.3   171   136   591   631  

An Experiment

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<complexType name=“Stock”> <sequence> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType >

<complexType name=“Stock”> <sequence> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >

�  Splits by deeper differences –more precision

Clone Detection for Web Services

�  Contextual clone detection with WSCells works!

�  Not only finds similar web service operations, but uncovers similar operations we could not find in any other way

<operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation>

<operation name="GetRealChartCustom"> <input message="GetRealChartCustomSoapIn"/> <output message="GetRealChartCustomSoapOut"/> </operation>

<operation name="GetLastSaleChartCustom"> <input message="GetLastSaleChartCustomSoapIn"/> <output message="GetLastSaleChartCustomSoapOut"/> </operation>

<operation name=“DrawYieldCurveCustom”> <input message=“DrawYieldCurveCustomIn”/> <output message=“DrawYieldCurveCustomOut”/> </operation>

<operation name="GetTopicChartCustom"> <input message="GetTopicChartCustomSoapIn" /> <output message="GetTopicChartCustomSoapOut" /> </operation> <operation name="GetTopicBinaryChartCustom">

<input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>

Semantic Analysis of Web Services

�  Contextualized WSCells also make it possible to use data mining topic models to do semantic analysis of web services �  Because they provide self-contained documents of

significant size

�  Might topic models provide a different view of web service similarity?

Latent Dirichlet Allocation �  Latent Dirichlet Allocation (LDA) :

�  A statistical model to uncover latent topics

�  Identifies the correlation between documents in terms of shared latent topics (sets of tokens)

�  Accepts a set of documents (e.g., source files) as input, returns probability distributions over inferred topics (a topic model) as output �  Each document has some probability of being related

to topic 1, another probability for topic 2, and so on

�  Similar documents should be related to similar topics

Latent Dirichlet Allocation �  Documents are represented in the model in terms

of probability distributions over topics

�  Similarity between documents is found using the Hellinger Distance �  A measure of how much agreement there is between

the shared topics of two documents �  Almost identical documents have a small Hellinger

Distance since they will be related to the same topics �  In terms of web services, small Hellinger Distances

indicate highly related operations

Evaluating WSCells

�  To evaluate the use of WSCells with LDA, we : �  Generate an LDA model for the original <operation>

elements, and another for the contextualized WSCells �  Explore the Global and Local Similarity between each

pair of operations in the models

�  Global Similarity an overall view of the most closely related web service operations in the service set

�  Local Similarity a per-operation view of the other most related web service operations for each operation

Global Similarity �  We look at Global Similarity using a visualization

called Bluevis

�  Bluevis shows the global conceptual structure of a system by highlighting similar operations using an illuminated line from left-to-right �  Plot some top fraction of similar operations

(top 25,000 in our examples) �  Use a consistently ordered list of web service

operations for the LDA model to view the differences �  If a display is noisy, it is often an indication that the

model is not identifying meaningful data

Global Similarity

Global Similarity

�  For original raw operations: �  Bluevis highlights the LDA

most similar operations �  Some clear structure

�  However, most of this is due to shared keywords, like get and SOAP

�  This uncontextualized model has very little value

Global Similarity

Global Similarity

�  For contextualized WSCells: �  A clearer semantic

structure, less noise overall �  Operation similarity

becomes meaningful

�  Services with semantic similarity discovered �  E.g., Operations with

similar parameters or faults, such as those that manipulate holiday dates or financial rates

Local Similarity �  We can also examine the local similarity for each

individual operation �  Identify the complete ordered list of similarity scores

for an operation in the data set

�  Using the top similarity scores, evaluate how meaningful the data is from a user's perspective �  For example, how can I find the most similar web

service operations to the one I am using now?

�  We use a tool called POCO (Pairwise Observation of Concepts) to examine the most similar operations

Local Similarity

Local Similarity Operation Most similar WSCell Most similar original raw

WSDL operation

ListFinancials GetFinancialServicesFromList LanguagesList

ExportShipsAndCategories ExportIteneraryAndSteps Search

GetIssueData GetFlightData word_cloud

GetWeatherReport GetWeather GetIndices

GetAIDIBOR GetTRLIBOR GetCarriers

searchByIdentifier searchByNameAndAddress GetLastSecurityHeadlines

ToolsAndHardwareBox KitchenAndHousewareBox ListRenditions

GetReservations GetRoomAvailabilityForDay GetSOFIBOR

GetOtherProductInfo NextOtherProductPortion GetParkingInfo

GetAllSplitsByExchange GetAllCashDividendsByExchange GetTeamLoyalties2

Summary �  Very-high-level domain-specific languages such as

WSDL make poor targets for similarity analysis using clone detection and topic models �  Lack of local context prevents meaningful results

�  Contextualizing using WSCells exposes both cloning and semantic relationships between web operations �  Clone detection of WSCells identifies similar web

service operations �  Topic models of WSCells expose both global

system-wide semantic relationships and local individual relationships between operations

Current & Future �  Continue analysis of web services for the Personal

Web using our results

�  Apply contextualization to similarity analysis of other modeling and specification languages (currently Simulink, Stateflow and UML sequence diagrams)

�  Experiment with effect of contextualization on clone and topic model analysis of traditional languages such as Java and C (“contextual clones”)

James R. Cordy

David B. Skillicorn

Douglas Martin

Scott Grant

Questions?

Contextualized Analysis of Web Services

When is a Clone not a Clone? (and vice-versa)

Recommended