Transcript
Page 1: 130919   jim cordy - when is a clone not a clone

School of Computing

Kingston, Canada

Contextualized Analysis of Web Services

James R. Cordy

David B. Skillicorn

Douglas Martin

Scott Grant

When is a Clone not a Clone? (and vice-versa)

Page 2: 130919   jim cordy - when is a clone not a clone

Motivation �  The Personal Web

�  Rapidly growing number of web services makes it increasingly difficult to find and choose the right ones

�  Need a quick and convenient way to find alternatives

�  Hand tagging impractical – automation is needed!

Page 3: 130919   jim cordy - when is a clone not a clone

�  Automation �  Similarity detection techniques offer solutions!

�  Code clone detection from software engineering research can find similar code fragments – why not similar services?

�  Topic models from data mining research can find text documents with similar semantics – why not similar services?

Motivation

Page 4: 130919   jim cordy - when is a clone not a clone

Web Service Similarity �  Web services are stored in

service registries, containing WSDL service description files

�  Could apply clone detection to entire service descriptions

�  But what we really want are similar service operations

Page 5: 130919   jim cordy - when is a clone not a clone

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<complexType name=“Stock”> <sequence> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType >

Let’s try it!

<complexType name=“Stock”> <sequence> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >

Page 6: 130919   jim cordy - when is a clone not a clone

<operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation>

<operation name="GetTopicBinaryChartCustom"> <input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>

How about these?

Page 7: 130919   jim cordy - when is a clone not a clone

So what went wrong?

�  At this point we thought maybe our idea wasn’t going to work

�  Maybe clone detection can’t help with web service discovery?

�  But why? What’s so special about WSDL?

Page 8: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

Page 9: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

�  a <portType> element where the operations are declared;

Page 10: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

�  a <portType> element where the operations are declared;

�  <message> elements corresponding to inputs, outputs and faults of the operations;

Page 11: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  A WSDL service description has 3 main parts:

�  a <portType> element where the operations are declared;

�  <message> elements corresponding to inputs, outputs and faults of the operations;

�  and a <types> element containing an XML Schema that defines the data and structure types used in the messages

Page 12: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  This simple example service has two operations:

Page 13: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  This simple example service has two operations:

�  ReserveRoom

Page 14: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  This simple example service has two operations:

�  ReserveRoom

�  GetAvailableRooms

Page 15: 130919   jim cordy - when is a clone not a clone

Web Service Description Language (WSDL)

�  WSDL service description files contain descriptions of the operations that a web service has to offer

�  But the pieces of each operation’s own description are scattered over different parts of the WSDL file

�  Difficult to identify complete units to analyze and compare

Page 16: 130919   jim cordy - when is a clone not a clone

The Problem

�  This poses a problem for analysis techniques:

�  Operations cannot easily be compared for similarity using clone detectors, because there are no contiguous fragments to compare

�  And they cannot be analyzed using data mining topic models, because there are no separate complete documents to generate a model from

Page 17: 130919   jim cordy - when is a clone not a clone

Our Solution �  Our solution is to contextualize the original

<operation> elements, to create self-contained operation descriptions �  We use source transformation to inline remote

information from the context into the elements that reference or depend on them

�  We call these contextualized WSDL operations Web Service Cells, or WSCells �  The first example of a new kind of clone detection:

contextual clones

Page 18: 130919   jim cordy - when is a clone not a clone

Contextualizing WSDL Operations

Page 19: 130919   jim cordy - when is a clone not a clone

Contextual Clone Detection

Page 20: 130919   jim cordy - when is a clone not a clone

An Experiment �  We have run an experiment to investigate the

difference between clone detection on WSCells and original raw operations

�  Two sets of WSDL service description files: 1,100 operations and 7,500 operations

�  Compared NICAD clone detector results for each set at various near-miss difference thresholds

0% = exact clone, 10% = 1 line in 10 different, and so on

Page 21: 130919   jim cordy - when is a clone not a clone

An Experiment �  Number of clones decreases with WSCells

Difference  Threshold  

Clone  Pairs  in  Set  1   Clone  Pairs  in  Set  2  

Originals   WSCells   Originals   WSCells  

0.0   852   705   1434   1066  

0.1   852   734   1434   1228  

0.2   879   775   1438   1637  

0.3   884   813   1469   1637  

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<complexType name=“Stock”> <sequence> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType >

<complexType name=“Stock”> <sequence> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >

�  Reduction in false positives

Page 22: 130919   jim cordy - when is a clone not a clone

�  Number of clone classes can increase with WSCells

Difference  Threshold  

Clone  Classes  in  Set  1   Clone  Classes  in  Set  2  

Originals   WSCells   Originals   WSCells  

0.0   169   187   587   433  

0.1   169   139   587   499  

0.2   172   142   589   631  

0.3   171   136   591   631  

An Experiment

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<operation name="GetStock" > <input message="tns:GetStockRequest" /> <output message="tns:GetStockResponse" /> </operation>

<complexType name=“Stock”> <sequence> <element name=“Supplier” type=“xsd:string”/> <element name=“Warehouse” type=“xsd:string”/> <element name=“OnHand” type=“xsd:string”/> <element name=“OnOrder” type=“xsd:string”/> <element name=“Demand” type=“xsd:string”/> </sequence> </complexType >

<complexType name=“Stock”> <sequence> <element name=“date” type=“xsd:string”/> <element name=“open” type=“xsd:float”/> <element name=“high” type=“xsd:float”/> <element name=“low” type=“xsd:float”/> <element name=“close” type=“xsd:float”/> <element name=“volume” type=“xsd:float”/> </sequence> </complexType >

�  Splits by deeper differences –more precision

Page 23: 130919   jim cordy - when is a clone not a clone

Clone Detection for Web Services

�  Contextual clone detection with WSCells works!

�  Not only finds similar web service operations, but uncovers similar operations we could not find in any other way

<operation name=“DrawRateChartCustom”> <input message=“DrawRateChartCustomIn”/> <output message=“DrawRateChartCustomOut”/> </operation>

<operation name="GetRealChartCustom"> <input message="GetRealChartCustomSoapIn"/> <output message="GetRealChartCustomSoapOut"/> </operation>

<operation name="GetLastSaleChartCustom"> <input message="GetLastSaleChartCustomSoapIn"/> <output message="GetLastSaleChartCustomSoapOut"/> </operation>

<operation name=“DrawYieldCurveCustom”> <input message=“DrawYieldCurveCustomIn”/> <output message=“DrawYieldCurveCustomOut”/> </operation>

<operation name="GetTopicChartCustom"> <input message="GetTopicChartCustomSoapIn" /> <output message="GetTopicChartCustomSoapOut" /> </operation> <operation name="GetTopicBinaryChartCustom">

<input message="GetTopicBinaryChartCustomSoapIn"/> <output message="GetTopicBinaryChartCustomSoapOut"/> </operation>

Page 24: 130919   jim cordy - when is a clone not a clone

Semantic Analysis of Web Services

�  Contextualized WSCells also make it possible to use data mining topic models to do semantic analysis of web services �  Because they provide self-contained documents of

significant size

�  Might topic models provide a different view of web service similarity?

Page 25: 130919   jim cordy - when is a clone not a clone

Latent Dirichlet Allocation �  Latent Dirichlet Allocation (LDA) :

�  A statistical model to uncover latent topics

�  Identifies the correlation between documents in terms of shared latent topics (sets of tokens)

�  Accepts a set of documents (e.g., source files) as input, returns probability distributions over inferred topics (a topic model) as output �  Each document has some probability of being related

to topic 1, another probability for topic 2, and so on

�  Similar documents should be related to similar topics

Page 26: 130919   jim cordy - when is a clone not a clone

Latent Dirichlet Allocation �  Documents are represented in the model in terms

of probability distributions over topics

�  Similarity between documents is found using the Hellinger Distance �  A measure of how much agreement there is between

the shared topics of two documents �  Almost identical documents have a small Hellinger

Distance since they will be related to the same topics �  In terms of web services, small Hellinger Distances

indicate highly related operations

Page 27: 130919   jim cordy - when is a clone not a clone

Evaluating WSCells

�  To evaluate the use of WSCells with LDA, we : �  Generate an LDA model for the original <operation>

elements, and another for the contextualized WSCells �  Explore the Global and Local Similarity between each

pair of operations in the models

�  Global Similarity an overall view of the most closely related web service operations in the service set

�  Local Similarity a per-operation view of the other most related web service operations for each operation

Page 28: 130919   jim cordy - when is a clone not a clone

Global Similarity �  We look at Global Similarity using a visualization

called Bluevis

�  Bluevis shows the global conceptual structure of a system by highlighting similar operations using an illuminated line from left-to-right �  Plot some top fraction of similar operations

(top 25,000 in our examples) �  Use a consistently ordered list of web service

operations for the LDA model to view the differences �  If a display is noisy, it is often an indication that the

model is not identifying meaningful data

Page 29: 130919   jim cordy - when is a clone not a clone

Global Similarity

Page 30: 130919   jim cordy - when is a clone not a clone

Global Similarity

�  For original raw operations: �  Bluevis highlights the LDA

most similar operations �  Some clear structure

�  However, most of this is due to shared keywords, like get and SOAP

�  This uncontextualized model has very little value

Page 31: 130919   jim cordy - when is a clone not a clone

Global Similarity

Page 32: 130919   jim cordy - when is a clone not a clone

Global Similarity

�  For contextualized WSCells: �  A clearer semantic

structure, less noise overall �  Operation similarity

becomes meaningful

�  Services with semantic similarity discovered �  E.g., Operations with

similar parameters or faults, such as those that manipulate holiday dates or financial rates

Page 33: 130919   jim cordy - when is a clone not a clone

Local Similarity �  We can also examine the local similarity for each

individual operation �  Identify the complete ordered list of similarity scores

for an operation in the data set

�  Using the top similarity scores, evaluate how meaningful the data is from a user's perspective �  For example, how can I find the most similar web

service operations to the one I am using now?

�  We use a tool called POCO (Pairwise Observation of Concepts) to examine the most similar operations

Page 34: 130919   jim cordy - when is a clone not a clone

Local Similarity

Page 35: 130919   jim cordy - when is a clone not a clone

Local Similarity Operation Most similar WSCell Most similar original raw

WSDL operation

ListFinancials GetFinancialServicesFromList LanguagesList

ExportShipsAndCategories ExportIteneraryAndSteps Search

GetIssueData GetFlightData word_cloud

GetWeatherReport GetWeather GetIndices

GetAIDIBOR GetTRLIBOR GetCarriers

searchByIdentifier searchByNameAndAddress GetLastSecurityHeadlines

ToolsAndHardwareBox KitchenAndHousewareBox ListRenditions

GetReservations GetRoomAvailabilityForDay GetSOFIBOR

GetOtherProductInfo NextOtherProductPortion GetParkingInfo

GetAllSplitsByExchange GetAllCashDividendsByExchange GetTeamLoyalties2

Page 36: 130919   jim cordy - when is a clone not a clone

Summary �  Very-high-level domain-specific languages such as

WSDL make poor targets for similarity analysis using clone detection and topic models �  Lack of local context prevents meaningful results

�  Contextualizing using WSCells exposes both cloning and semantic relationships between web operations �  Clone detection of WSCells identifies similar web

service operations �  Topic models of WSCells expose both global

system-wide semantic relationships and local individual relationships between operations

Page 37: 130919   jim cordy - when is a clone not a clone

Current & Future �  Continue analysis of web services for the Personal

Web using our results

�  Apply contextualization to similarity analysis of other modeling and specification languages (currently Simulink, Stateflow and UML sequence diagrams)

�  Experiment with effect of contextualization on clone and topic model analysis of traditional languages such as Java and C (“contextual clones”)

Page 38: 130919   jim cordy - when is a clone not a clone

James R. Cordy

David B. Skillicorn

Douglas Martin

Scott Grant

Questions?

Contextualized Analysis of Web Services

When is a Clone not a Clone? (and vice-versa)