70
Data Science elevating Software Engineering Software Engineering elevating Data Science Miryung Kim University of California, Los Angeles

Data Science elevating Software Engineeringelevating Data ...web.cs.ucla.edu/~miryung/Huawei-Oct2018-MiryungKim-UCLA-DS4SEandSE4DS.… · Visualization Toolkit.2 It visualizes the

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

DataScienceelevating SoftwareEngineeringSoftwareEngineeringelevating DataScience

Miryung KimUniversityofCalifornia,LosAngeles

SoftwareEngineeringandAnalysisLabatUCLA

Interactive Code Review /Clone Search

Refactoring and Transformation

Test�Debugging�View

Atomic�Change�ViewExtended�Call�Graph View

Figure 2: Overview of FaultTracer’s front-end.

2. FAULTTRACER IMPLEMENTATIONFaultTracer is implemented as a toolkit, which includes

the front-end plug-in, and the back-end library. The follow-ing subsections present the details of the toolkit.

2.1 FaultTracer Plug-inThe front-end of FaultTracer is implemented as an Eclipse

IDE plugin. The plugin takes two program versions as inputand extracts atomic program changes from the two versionsbased on the abstract syntax tree (AST) analysis providedby the Eclipse JDT toolkit.1 It traverses the ASTs of twoversions to compare fields and methods by their fully quali-fied names to find atomic changes. For each pair of comparedmethods, FaultTracer filters out all comments and white-spaces before comparison. FaultTracer also finds call andaccess dependencies between atomic changes by tracing thedefinition and reference of each used method and field.

The FaultTracer plugin also includes three views to visu-alize internal and final outputs of FaultTracer (Figure 2):Atomic-change view is implemented using the Eclipse ZestVisualization Toolkit.2 It visualizes the all atomic changesbetween program versions and their dependencies, and sup-ports various user interaction (details shown in Appendix).Note that this view depends on the data produced by Step1 in Figure 1.Extended-call-graph view is also implemented using theEclipse Zest Visualization Toolkit. It visualizes the extendedcall graphs for individual tests. This view can help the userto better understand the behaviors of individual tests. Thisview depends on the data produced by Step 2 in Figure 1.Testing-debugging view lists the affected tests betweentwo compared versions, the affecting changes for each af-fected test, and the ranked list of affecting changes for eachfailed test. This view visualizes all the final outputs of Fault-

1http://www.eclipse.org/jdt/. Accessed in Aug. 2012.2http://www.eclipse.org/gef/zest/. Accessed in Aug.2012.

Tracer. When the user double-clicks an affected test in theview, the view immediately displays the affecting changes forthe selected test. The view would also display the ranked listof affecting changes for the test along with their suspicious-ness scores computed based on program spectra. When theuser double-clicks any affecting change in the view, Fault-Tracer extracts corresponding changed code fragments inthe Java Editor to facilitate manual inspection of relevantcode. Note that this view uses the data produced by Steps3, 4, and 5 respectively.

2.2 FaultTracer LibraryThe back-end of FaultTracer is implemented as Ant3

tasks, which fully automate the process of constructing ex-tended call graphs, selecting affected tests, determining af-fecting changes, and ranking affecting changes for failed tests.The back-end performs the ECG construction task on-the-fly through byte code instrumentation. It dynamically in-struments classes loaded into the JVM through a Java agentwithout any modification of the target program. For instru-mentation, FaultTracer uses the ASM4 bytecode manipu-lation and analysis framework. We extend visitor classesin ASM and override visit methods to trace method in-vocation relations, field access relations, and associated at-tributes (e.g., receiver object types, static target methodsfor virtual method invocations, and types of field accesses).

The back-end of FaultTracer also performs all the coreanalysis tasks: selection of affected tests, determination ofaffecting changes, and spectrum-based ranking of affecting-changes. The final results are then visualized by the front-end plugin.

3. DEMONSTRATIONThis section illustrates how to configure FaultTracer and

how to perform the five key steps for regression test execu-

3http://ant.apache.org/. Accessed in Aug. 2012.4http://asm.ow2.org/. Accessed in Aug. 2012.

DebuggingProgram Comprehension

DataScienceelevatingSoftwareEngineering

EmpiricalStudiesofSoftwareChanges

CodeRedundancyClonegenealogyCopyandpastepracticesLonglivedclonesSoftwareforkingandcodeporting

SoftwareRefactoringRefactoringFieldStudyQuantifyingRefactoringCostandBenefitsImpactonRegressionTestingRoleofAPIRefactoring

SoftwarePatchesSupplementarypatchesOmissionerrors

APIEvolutionRoleofAPIRefactoringAPIStability

DataScienceelevatingSoftwareEngineering

EmpiricalStudiesofSoftwareChanges

AutomatedandInteractive

SoftwareDevTools

CodeRedundancyClonegenealogyCopyandpastepracticesLonglivedclonesSoftwareforkingandcodeporting

SoftwareRefactoringRefactoringFieldStudyQuantifyingRefactoringCostandBenefitsImpactonRegressionTestingRoleofAPIRefactoring

SoftwarePatchesSupplementarypatchesOmissionerrors

TransplantationandTestReuseGrafter

APIEvolutionRoleofAPIRefactoringAPIStability

InteractiveCodeReviewCritics

APIUsageAdaptationLibSync6AURAAPIMatching

LogicalProgramDifferencingLSdiffVdiff forVHDL

RefactoringReconstructionRefFinder

CloneRemovalRefactoringRASE

DataScienceelevatingSoftwareEngineering

EmpiricalStudiesofSoftwareChanges

RecommendationSystems

AutomatedandInteractive

SoftwareDevTools

CodeRedundancyClonegenealogyCopyandpastepracticesLonglivedclonesSoftwareforkingandcodeporting

SoftwareRefactoringRefactoringFieldStudyQuantifyingRefactoringCostandBenefitsImpactonRegressionTestingRoleofAPIRefactoring

ProgramTransformationfromExamplesSyditLASECookbook

BugFindingRefactoringBugsCloningInconsistenciesFaultTracerModularityViolationsPrioritizingTestsforRefactoring

SoftwarePatchesSupplementarypatchesOmissionerrors

TransplantationandTestReuseGrafter

APIEvolutionRoleofAPIRefactoringAPIStability

InteractiveCodeReviewCritics

APIUsageAdaptationLibSync6AURAAPIMatching

LogicalProgramDifferencingLSdiffVdiff forVHDL

RefactoringReconstructionRefFinder

CloneRemovalRefactoringRASE

Part1:SoftwareEngineeringelevatingDataScience

SEToolsforBigDataAnalytics• InteractiveDebugger• DataProvenance• AutomatedDebugging

DataScientistsinSoftwareTeams• Background• WorkActivities• Challenges• BestPractices• QualityAssurance

TheEmergingRolesofDataScientistsonSoftwareTeams[ICSE2016]

Weareatatippingpoint wheretherearelargescaletelemetry,machine,processandqualitydata.

DatascientistsareemergingrolesinSWteamsduetoanincreasingdemandforexperimentingwithrealusersandreportingresultswithstatisticalrigor.

Wehaveconductedthefirstin-depthinterviewstudyandthelargestscalesurveyofprofessionaldatascientists tocharacterizeworkingstyles.

InsightProvider Specialists PlatformBuilder Polymath TeamLeader

BigDataDebugging

DataScientists

ChallengesinEnsuring“Correctness”BigDataDebugging

DataScientists

Validation isamajorchallenge.

“Honestly,wedon’thaveagoodmethodforthis.”[P457]“Justbecausethemathisright,doesn’tmeanthattheanswerisright.”[P307]Explainability isimportant.Participantswarnedaboutoverrelianceonaggregatemetrics— “togaininsights,youmustgooneleveldeeper.”

Developlocally Hopeitworks Runincloud Bug!

Guesswork

Map Reduce

BigDebug:DebuggingPrimitivesforInteractiveBigDataProcessinginSpark

MuhammadAliGulzar,MatteoInterlandi,Seunghyun Yoo,SaiDeepTetali,TysonCondie,ToddMillstein,Miryung Kim[ICSE2016,FSEToolDemo2016,SIGMODToolDemo2017]

BigDataDebugging

DataScientists

RunningaMapReduceJobonCluster

Ajobisdistributedtoworkersincluster

Eachworkerperformspipelinedtransformationsonapartitionwithmillionsofrecords

Ausersubmitsajob

ReduceFilter Map

MotivatingScenario:ElectionRecordAnalysis

• AlicewritesaSparkprogramthatrunscorrectlyonlocalmachine(100MBdata)butcrashesoncluster(1TB)

• Alicecannotseethecrash-inducingintermediateresult.

• Alicecannotidentifywhichinputfrom1TBcausingcrash

• Whencrashoccurs,allintermediateresultsarethrownaway.

1234567

VoterIDCandidateStateTime9213SandersTX1440023087

Task 31 failed 3 times; aborting jobERROR Executor: Exception in task 31 in stage 0 (TID 31)

java.lang.NumberFormatException

val log = "s3n://poll.log"val text_file = spark.textFile(log)val count = text_file.filter( line => line.split()[3].toInt > 1440012701).map(line = > (line.split()[1] , 1)).reduceByKey(_ + _).collect()

WhyTraditionalDebugPrimitivesDoNotWorkforApacheSpark?

Enablinginteractivedebuggingrequiresusto re-thinkthefeaturesoftraditionaldebuggersuchasGDB

• Pausingtheentirecomputationonthecloudcouldreducethroughput

• Itisclearlyinfeasibleforausertoinspectbillionofrecordsthrougharegularwatchpoint

• EvenlaunchingremoteJVMdebuggerstoindividualworkernodescannotscaleforbigdatacomputing

Stage3Stage2Stage1

1.SimulatedBreakpoint

Flatmap Map ReduceByKey Map ReduceByKey Filter

Breakpoint

Storeddatarecords

Simulatedbreakpointreplayscomputationfromthelatestmaterializationpointwheredataisstoredinmemory

1.SimulatedBreakpoint – Realtime CodeFix

Stage3Stage2Stage1

Flatmap Map ReduceByKey Map ReduceByKey Filter

Breakpoint

Allowausertofixcodeafterthebreakpoint

state.equals(“TX”)||state.equals(“CA”)

Stage2

2.On-DemandGuardedWatchpoint

ReduceByKey Map Watchpoint

Watchpoint capturesindividualdatarecordsmatchingauser-providedguard

Stage1

Flatmap Map

3.CrashCulprit Remediation

Ausercaneithercorrectthecrashedrecord,skipthecrashculprit,orsupplyacodefixtorepairthecrashculprit.

Stage3Stage2Stage1

Flatmap Map ReduceByKey Map ReduceByKey Filter

Task 31 failed 3 times; aborting jobERROR Executor: Exception in task 31 in stage 0 (TID 31)java.lang.NumberFormatException

4.BackwardandForwardTracing

Stage3Stage2Stage1

Flatmap Map ReduceByKey Map ReduceByKey Filter

Ausercanalsoissuetracingqueriesonintermediaterecordsatrealtime

Demo:BigDebug InteractiveDebugger[FSE2016Demo,SIGMOD2017Demo]

Q1:HowdoesBigDebug scaletomassivedata?

BigDebug retainsscaleuppropertyofSpark.ThispropertyiscriticalforBigDataprocessingframeworks

1

10

100

1000

10000

0.5 0.9 4 8 30 70 200 1000

Time(s)

DatasetSize(GB)

BigDebugScaleUp

BigDebug Spark

Q2:Whatistheperformanceoverheadofdebuggingprimitives?

Program Datasetsize(GB)

Max Maxw/oLatencyAlert

Watchpoint CrashCulprit

Tracing

WordCount 0.5 - 1000 2.5X 1.34X 1.09X 1.18X 1.22X

Grep 20- 90 1.76X 1.07X 1.05X 1.04X 1.05X

PigMix-L1 1- 200 1.38X 1.29X 1.03X 1.19X 1.24X

BigDebug posesatmost2.5Xoverheadwiththemaximuminstrumentationsetting.

Max:AllthefeaturesofBigDebug areenabled

Titian:DataProvenanceSupportinSpark

MatteoInterlandi,Kshitij Shah,SaiDeepTetali,MuhammadAliGulzar,Seunghyun Yoo,Miryung Kim,ToddMillstein,TysonCondie[42ndConferenceonVeryLargeDataBases,VLDB2016]

BigDataDebugging

DataScientists

Tuple-ID Time Sendor-ID Temperature

T1 11AM 1 34

T2 11AM 2 35

T3 11AM 3 35

T4 12PM 1 35

T5 12PM 2 35

T6 12PM 3 100

T7 1PM 1 35

T8 1PM 2 35

T9 1PM 3 80

SELECTtime,AVG(temp)FROMsensorsGROUPBYtime

Sensors

Result-ID

Time AVG(temp)

ID-1 11AM 34.6

ID-2 12PM 56.6

ID-3 1PM 50

OutlierOutlier

WhyID-2andID-3havethosehighvalues?

DataProvenance– ExampleinSQL

CombinerLineageRDD

ReducerLineageRDD

HadoopLineageRDD

counts

pairscodeserrorslines

Stage 1

Stage 2

reports StageLineageRDD

Input ID Output ID

{ id1, id 3} 400

{ id2 } 4

Input ID

Output ID

offset1 id1

offset2 id2

offset3 id3

Input ID Output ID

[p1, p2] 400

[ p1 ] 4

Input ID Output ID

400 id1

4 id2

Step1:InstrumentedWorkflowinSpark

Worker1

Worker2

Worker3

Input ID Output ID

[p1, p2] 400

[ p1 ] 4

Input ID Output ID

400 id1

4 id2

Reducer Stage

Input ID Output ID

offset1 id1

offset2 id2

offset3 id3

Input ID Output ID

{ id1, id 3} 400

{ id2 } 4

Hadoop Combiner

Input ID Output ID

offset1 id1

… …

Input ID Output ID

{ id1, …} 400

Hadoop Combiner

Worker1

Worker2

Worker3

Input ID Output ID

[p1, p2] 400

[ p1 ] 4

Input ID Output ID

400 id1

4 id2

Reducer Stage

Input ID Output ID

offset1 id1

offset2 id2

offset3 id3

Input ID Output ID

{ id1, id 3} 400

{ id2 } 4

Hadoop Combiner

Input ID Output ID

offset1 id1

… …

Input ID Output ID

{ id1, …} 400

Hadoop Combiner

Input ID Output ID

p1 400

Input ID Output ID

p1 400

Stage.Input IDReducer.Output ID

Step2:ExampleBackwardTracing

Worker1

Worker2

Input ID Output ID

offset1 id1

offset2 id2

offset3 id3

Input ID Output ID

{ id1, id 3} 400

{ id2 } 4

Hadoop Combiner

Input ID Output ID

offset1 id1

… …

Input ID Output ID

{ id1, …} 400

Hadoop Combiner

Input ID Output ID

p1 400

Input ID Output ID

p1 400

Combiner.Output ID Reducer.Output ID

Combiner.Output ID Reducer.Output ID

Step2:ExampleBackwardTracing

Hadoop Combiner

Worker1

Worker2

Input ID Output ID

offset1 id1

offset2 id2

offset3 id3

Input ID Output ID

{ id1, id 3} 400

{ id2 } 4

Hadoop Combiner

Input ID Output ID

offset1 id1

… …

Input ID Output ID

{ id1, …} 400

Hadoop Combiner

Worker1

Worker2

Input ID Output ID

offset1 id1

offset2 id2

offset3 id3

Input ID Output ID

{ id1, id 3} 400

{ id2 } 4

Hadoop Combiner

Input ID Output ID

offset1 id1

… …

Input ID Output ID

{ id1, …} 400

Hadoop Combiner

Combiner.Input IDHadoop.Output ID

Combiner.Input IDHadoop.Output ID

Step2:ExampleBackwardTracing

AutomatedDebugginginDataIntensiveScalableComputing

MuhammadAliGulzar,MatteoInterlandi,Xueyuan Han,Mingda LiTysonCondie,MiryungKim[ACMSymposiumonCloudComputing,SoCC 2017]

BigDataDebugging

DataScientists

MotivatingExample

• AlicewritesaSparkprogramthatidentifies,foreachstateintheUS,thedeltabetweentheminimumandthemaximumsnowfallreadingforeachdayofanyyearandforanyparticularyear.

• Aninputdatarecordthatmeasures1footofsnowfallonJanuary1stofYear1992,inthe99504zipcode(Anchorage,AK)area,appearsas

99504 ,01/01/1992,1ft

ProblemDefinition

29

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm

AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]

TextFile FlatMap GroupByKey Map Output

AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245

…. ….

AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0

Givenatestfunction,thegoalistoidentifyaminimumsubsetoftheinputthatisabletoreproducethesametestfailure.

def test(key:String, delta: Float) : Boolean = {delta < 6000

}

• Usingatestfunction,ausercanspecifyincorrectresults

30

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm

AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]

TextFile FlatMap GroupByKey Map Output

AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245

…. ….

AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0

ExistingApproach1:DataProvenanceforSpark

Itover-approximatesthescopeoffailure-inducinginputsi.e.recordsinthefaultykey-groupareallmarkedasfaulty

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

31

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994,245mm99504,01/01/1993,85mm90031,02/01/1991,0mm

AK,01/01,304.8AK,1992 , 304.8AK,03/01 ,30.5AK,1992 ,30.5AK,01/01 ,21336AK,1993 , 21336AK,03/01 ,145AK,1993 ,145AK,01/01 ,245AK,1994 ,245

…. ….

AK ,01/01,[304.8,21336,245,85]AK ,03/01 ,[30.5,145]AK ,1992 ,[304.8,30.5]AK ,1993 ,[21336,145, 85]AK ,1994 ,[245]CA,02/01 ,[0]CA,1991 ,[0]

AK , 01/01 ,21251AK , 03/01 ,114.5AK , 1992 ,274.3AK , 1993 ,21251AK , 1994 ,0CA, 02/01 ,0CA, 1991 ,0

TextFile FlatMap GroupByKey Map Output

1

2

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

32

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,01/01,304.8AK,1992 , 304.8AK,03/01 ,30.5AK,1992 ,30.5AK,01/01 ,21336AK,1993 , 21336

AK ,01/01,[304.8,21336]AK ,03/01 ,[30.5]AK ,1992 ,[304.8,30.5]AK ,1993 ,[21336]

AK, 01/01 ,21031AK , 03/01 , 0AK , 1992 ,274.3AK , 1993 , 0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

1

2

Run 2

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

33

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,01/01,304.8AK,1992 , 304.8AK,03/01 ,30.5AK,1992 ,30.5

AK ,01/01,[304.8]AK ,03/01 ,[30.5]AK ,1992 ,[304.8,30.5]

AK , 01/01 ,0AK , 03/01 , 0AK , 1992 ,274.3

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 3

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

34

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,01/01 ,21336AK,1993 , 21336

AK ,01/01,[21336]AK ,1993 ,[21336]

AK , 01/01 ,0AK , 1993 ,0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 4

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

35

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,01/01,304.8AK,1992 , 304.8

AK ,01/01,[304.8]AK ,1992 ,[304.8]

AK , 01/01 ,0AK , 1992 ,0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 5

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

36

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,03/01 ,30.5AK,1992 ,30.5

AK ,03/01 ,[30.5]AK ,1992 ,[30.5]

AK , 03/01 , 0AK , 1992 ,0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 6

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

37

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,01/01 ,21336AK,1993 , 21336

AK ,01/01,[21336]AK ,1993 ,[21336]

AK , 01/01 ,0AK , 1993 ,0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 7

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

38

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,03/01 ,30.5AK,1992 ,30.5AK,01/01 ,21336AK,1993 , 21336

AK ,01/01,[21336]AK ,03/01 ,[30.5]AK ,1992 ,[30.5]AK ,1993 ,[21336]

AK, 01/01 ,0AK , 03/01 , 0AK , 1992 ,0AK , 1993 , 0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 8

ExistingApproach2:DeltaDebugging

• DeltaDebuggingperformsasystematicbinarysearch-likeprocedureontheinputdatasetusingatestoraclefunction

39

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in

AK,01/01,304.8AK,1992 , 304.8AK,01/01 ,21336AK,1993 , 21336

AK ,01/01,[304.8,21336]AK ,1992 ,[304.8]AK ,1993 ,[21336]

AK, 01/01 ,21031AK , 1992 ,0AK , 1993 , 0

TextFile FlatMap GroupByKey Map Output

Itdoesnotpruneinputrecordsknowntobeirrelevantbecauseofthelackofsemanticunderstandingofdata-flowoperators

Run 9

AutomatedDebugginginDISCwithBigSift

40

Test PredicatePushdown

PrioritizingBackwardTraces

BitmapbasedTest

Memoization

Input:ASparkProgram,ATestFunction Output:MinimumFault-InducingInputRecords

DataProvenance+DeltaDebugging

41

Optimization1: TestPredicatePushdown

Ifapplicable,BigSift pushesdownthetestfunctiontotesttheoutputofcombinersinordertoisolatethefaultypartitions.

• Observation: Duringbackwardtracing,dataprovenancetracesthroughallthepartitionseventhoughonlyafewpartitionsarefaulty

Test

Test

Test

Test

Test

Test

Test

42

99504,01/01/1992,1ft99504,03/01/1992,0.1ft99504,01/01/1993, 70in99504,03/01/1993,145mm99504,01/01/1994 ,245mm99504,01/01/1993 ,85mm90031,02/01/1991 ,0mm

AK, 01/01 ,[304.8,21336,245,85]AK, 03/01 ,[30.5,145]AK, 1992 ,[304.8,30.5]AK, 1993 ,[21336,145, 85]AK, 1994 ,[245]CA, 02/01 ,[0]CA, 1991 ,[0]

TextFile FlatMap GroupByKey Map Output

AK ,01/01,304.8AK ,1992 , 304.8AK ,03/01 ,30.5AK ,1992 ,30.5AK ,01/01 ,21336AK ,1993 , 21336AK ,03/01 ,145AK ,1993 ,145AK ,01/01 ,245AK ,1994 ,245

…. ….

AK ,01/01,21251AK ,03/01,114.5AK ,1992 ,274.3AK ,1993 ,21251AK ,1994 ,0CA ,02/01,0CA ,1991 ,0

Optimization2:PrioritizingBackwardTraces

Incaseofmultiplefaultyoutputs,BigSift overlapstwobackwardtracestominimizethescopeoffault-inducinginputrecords

• Observation:Thesamefaultyinputrecordmaycontributetomultipleoutputrecordsfailingthetest.

43

Optimization3:BitmapBasedTestMemoization

Weuseabitmapbasedtestmemoization techniquetoavoidredundanttestingofthesameinputdataset.

• Observation:Deltadebuggingmaytryrunningaprogramonthesamesubsetofinputredundantly.

0

1

0

1

0

0

0

0

1

1

Input Data Bitmap

𝗫

Test Outcome

• BigSift leveragesbitmaptocompactlyencodetheoffsetsoforiginalinputtorefertoaninputsubset

RQ1:PerformanceImprovementoverDeltaDebugging

SubjectProgram RunningTime(sec) DebuggingTime(sec)

SubjectProgram Fault OriginalJob DD BigSift Improvement

Movie Histogram Code 56.2 232.8 17.3 13.5X

InvertedIndex Code 107.7 584.2 13.4 43.6X

RatingHistogram Code 40.3 263.4 16.6 15.9X

SequenceCount Code 356.0 13772.1 208.8 66.0X

Rating Frequency Code 77.5 437.9 14.9 29.5X

CollegeStudent Data 53.1 235.3 31.8 7.4X

WeatherAnalysis Data 238.5 999.1 89.9 11.1X

Transit Analysis Code 45.5 375.8 20.2 18.6X

BigSift providesuptoa66Xspeedupinisolatingtheprecisefault-inducinginputrecords,incomparisontothebaselineDD

RQ2:DebuggingTimevs.Originaljobtime

SubjectProgram RunningTime(sec) DebuggingTime(sec)

SubjectProgram Fault OriginalJob DD BigSift Improvement

Movie Histogram Code 56.2 232.8 17.3 13.5X

InvertedIndex Code 107.7 584.2 13.4 43.6X

RatingHistogram Code 40.3 263.4 16.6 15.9X

SequenceCount Code 356.0 13772.1 208.8 66.0X

Rating Frequency Code 77.5 437.9 14.9 29.5X

CollegeStudent Data 53.1 235.3 31.8 7.4X

WeatherAnalysis Data 238.5 999.1 89.9 11.1X

Transit Analysis Code 45.5 375.8 20.2 18.6X

Onaverage,BigSift takes62%lesstimetodebugasinglefaultyoutput thanthetimetakenforasinglerunontheentiredata.

RQ2:DebuggingTime

1

10

100

1000

10000

100000

1000000

10000000

1000000001E+09

0 2000 4000 6000 8000 10000 12000 14000

#offault-ind

ucinginpu

trecords

FaultLocalizationTime(s)

SequenceCount

DeltaDebugging BigSift

TestDrivenDataProvenance DataProvenance

Onaverage,BigSift takes62%lesstimetodebugasinglefaultyoutput thanthetimetakenforasinglerunontheentiredata.

RQ3:FaultLocalizabilityoverDataProvenance

143796

6487290

520904

234115800

15003060

2554788

350

2

1350

15 13

1 1 1 1 1 12

1

10

100

1000

10000

100000

1000000

10000000

100000000

MovieHistorgram

InvertedIndex

RatingHistogram

SequenceCount

RatingFrequency

CollegeStudents

WeatherAnalysis

#offault-ind

ucinginpu

trecords

DataProvenance TestDrivenDataProvenance BigSift

BigSift leveragesDDafterDPtocontinuefaultisolation,achievingseveralordersofmagnitude103 to107 betterprecision

Part2:DataScienceelevatingSoftwareEngineering

MiningAPIUsage

FindingDefects VisualizingCodeExamplesatScale

APIUsageMiningfromGitHub[ICSE2018]

Wecontrast SOsnippetswithAPIusagepatternsminedfrom380KGitHub projects.

49

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

Insight1:MiningaLargeCodeCorpus

Ourcodecorpusincludes380KGitHub projectswithatleast100revisionsand2contributors.

50

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

Dyer et al. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. ICSE 2013.

Insight2:RemovingIrrelevantStatementsviaProgramSlicing

Weperformbackwardandforwardslicingtoidentifydata- andcontrol-dependentstatementstoanAPImethodofinterest.

51

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

52

void initInterfaceProperties(String temp, File dDir) {if(!temp.equals("props.txt")) {

log.error("Wrong Template."); return;

} // load default properties FileInputStream in = new FileInputStream(temp); Properties prop = new Properties(); prop.load(in); ... init properties ...// write to the property file String fPath=dDir.getAbsolutePath()+"/interface.prop"; File file = new File(fPath); if(!file.exists()) {

file.createNewFile();} FileOutputStream out = new FileOutputStream(file); prop.store(out, null); in.close();

}

GitHub example of File.createNewFile

The focal API method

53

void initInterfaceProperties(String temp, File dDir) {if(!temp.equals("props.txt")) {

log.error("Wrong Template."); return;

}// load default properties FileInputStream in = new FileInputStream(temp); Properties prop = new Properties(); prop.load(in); ... init properties ...// write to the property file String fPath=dDir.getAbsolutePath()+"/interface.prop"; File file = new File(fPath);if(!file.exists()) {

file.createNewFile();}FileOutputStream out = new FileOutputStream(file);prop.store(out, null); in.close();

}

Data dependency up to one hop, i.e., direct dependency

The focal API method

data

control

54

void initInterfaceProperties(String temp, File dDir) {if(!temp.equals("props.txt")) {

log.error("Wrong Template."); return;

}// load default properties FileInputStream in = new FileInputStream(temp); Properties prop = new Properties(); prop.load(in); ... init properties ...// write to the property file String fPath=dDir.getAbsolutePath()+"/interface.prop"; File file = new File(fPath); if(!file.exists()) {

file.createNewFile();}FileOutputStream out = new FileOutputStream(file);prop.store(out, null); in.close();

}

Data dependency up to two hops

The focal API method

data

control

Insight3:CaptureSemanticsInfoinAPIUsage

Itisimportanttocapturethetemporalordering,enclosingcontrolstructures,andappropriateguardconditionsofAPIcalls.

55

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

Insight3:CaptureSemanticsInfoinAPIUsage

Itisimportanttocapturethetemporalordering,enclosingcontrolstructures,andappropriateguardconditionsofAPIcalls.

56

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

new File (String); try {; new FileInputStream(File)@arg0.exists(); } catch (IOException) {; }

Insight3:CaptureSemanticsInfoinAPIUsage

Itisimportanttocapturethetemporalordering,enclosingcontrolstructures,andappropriateguardconditionsofAPIcalls.

57

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

new File (String); try {; new FileInputStream(File)@arg0.exists(); } catch (IOException) {; }

Insight3:CaptureSemanticsInfoinAPIUsage

Itisimportanttocapturethetemporalordering,enclosingcontrolstructures,andappropriateguardconditionsofAPIcalls.

58

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

new File (String); try {; new FileInputStream(File)@arg0.exists(); } catch (IOException) {; }

Insight3:CaptureSemanticsInfoinAPIUsage

Itisimportanttocapturethetemporalordering,enclosingcontrolstructures,andappropriateguardconditionsofAPIcalls.

59

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

API usage patterns

380KJavaRepositoriesonGitHub

1 2

3

new File (String); try {; new FileInputStream(File)@arg0.exists(); } catch (IOException) {; }

Insight4:VariationsinGuardConditions

Guardconditionsarecanonicalized andgroupedbasedonlogicalequivalence.

60

Two equivalent guard conditions for String.substring:

arg0>=0 && arg0<=rcv.length() ⇔ arg0>-1 && arg0<rcv.length()+1

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

380KJavaRepositoriesonGitHub

1 2

3

Insight4:VariationsinGuardConditions

Guardconditionsarecanonicalized andgroupedbasedonlogicalequivalence.

61

Two equivalent guard conditions for String.substring:

arg0>=0 && arg0<=rcv.length() ⇔ arg0>-1 && arg0<rcv.length()+1

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

380KJavaRepositoriesonGitHub

1 2

3

Insight4:VariationsinGuardConditions

Guardconditionsarecanonicalized andgroupedbasedonlogicalequivalence.

62

Two equivalent guard conditions for String.substring:

arg0>=0 && arg0<=rcv.length() ⇔ arg0>-1 && arg0<rcv.length()+1

CodeSearch

ProgramSlicing

CallSequenceExtraction

Structured API call sequences

FrequentSequence Mining

SMT-basedGuardConditionMining

380KJavaRepositoriesonGitHub

1 2

3

APIMisuseDetectioninStackOverflow

Weexamine220KSOpostswith180confirmedpatterns.

=>31%ofSOpostscontainAPIusageviolations!

63

java

Stack Overflow snippets

SubsequenceCheckCallSequence

Extraction GuardConditionCheckStructured

API call sequences

180ValidPatterns

API usage violations

Dataset: http://web.cs.ucla.edu/~tianyi.zhang/examplecheck.html

query

pattern(s)

APIMisusesarePrevalentonStackOverflow

Highly-votedpostsarenotnecessarilymorereliableintermsofcorrectAPIusage.

Network,database,IO,crypto,stringmanipulationAPIsaremorelikelytobemisused.

64

ExampleCheck:AugmentingStackOverflowwithAPIUsagePatternsMinedfromGitHub

65Add Chrome Logo

https://chrome.google.com/webstore/detail/examplecheck/amliempebckaiaklimcpopomlnklkioe

https://chrome.google.com/webstore/detail/examplecheck/amliempebckaiaklimcpopomlnklkioe

Examplore:VisualizingCodeExamplesatScale[CHI2018]Examplore visualizeshundredsofcodeexamplesusingthesameAPI.

67

Examplore:VisualizingCodeExamplesatScale[CHI2018]

68http://examplore.cs.ucla.edu:3000

Thankstomycollaborators

UCLAonBigDataDebugging:MuhammadAliGulzar*,TysonCondie,MatteoInterlandi, Mingda Li,MichaelHan,SaiDeepTetali,ToddMillstein

MicrosoftResearchonDataScientistStudies:TomZimmermann,AndrewBegel,andRobDeLine

UCLA,UCBerkeley,IowaStateonAPIUsageMining:Tianyi Zhang*,ElenaGlassman,BjornHartmann,GaneshaUpadhyaya,AnastasiaReinhart,Hridesh Rajan

BigDataneeds awesomesoftwareengineering tools

✓ Debugging✓ Intelligentsampling

andtesting✓ Rootcauseanalysis

✓ Datacleaning ✓ Performanceanalytics

✓ Codeanalytics

Diagnose Fix Optimize