48
Re-Engineering Software Engineering in a Data-Centric World Miryung Kim University of California, Los Angeles 1 Keynote at ASE 2019, The 34th IEEE/ACM International Conference on Automated Software Engineering

Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Re-EngineeringSoftwareEngineeringin aData-CentricWorld

MiryungKimUniversityofCalifornia,LosAngeles

1

Keynote atASE2019,The34thIEEE/ACMInternationalConferenceonAutomatedSoftwareEngineering

Page 2: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Confluence

Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote2

Page 3: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Confluence:InterdisciplinaryThinking

Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

Inflection Point

Page 4: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Confluence:Impressionism

Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

Inflection Point

Page 5: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Confluence:DataAnalyticsandSE

Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote

Inflection Point

AI

BigData

ML

Page 6: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

TakeawayMessage:ACaseforSoftwareEngineeringforDataAnalytics(SE4DA)

Bugfindingisahugeproblemindataanalytics.

SE4DA isunderserved;somehowpeoplehavegravitatedtoapplyingdataanalyticstoSE.

SE4DA requiresre-thinkingsoftwareengineeringtechniques.

6

Page 7: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Thereisahugeopportunity fordataanalytics.

7

Page 8: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Dataanalyticsareinhighdemand,yet…

8

Page 9: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Bugsarehugeproblemsindataanalytics.

9

Dataanalyticsusedbythousandsofscientistsproducemisleading orwrong results[BBCNews]

Thewidespreadharmincludesfromawrongmedicaldiagnosistoincorrectinterpretationofstockhistory[Dataversity]

Predictablyinaccurate:Theprevalenceandperilsofbadbigdata.[Deloitte]

Page 10: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

GrowthofDataAnalyticsPapersinSE

21 22 28 4738 50 40

39

0

50

100

2016 2017 2018 2019

DataAnalytics(AI,BigData,ML)GrowthinASEPapers

DataAnalytics Rest

10

Page 11: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

SE4DA isunder-investigated.(SE4DA:13,DA4SE:105)

SE4DA

4%

DA4SE

37% Rest

59%

11

SE4DA(4%):ImprovingSEfordataanalytics

DA4SE(37%):ApplyingdataanalyticstoSE

Page 12: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)

Studies:Data

Scientists

Tools

Shifttodata-centricSWdevelopment

Debugging& testingforbigdataanalytics

DifferencesbetweentraditionalSWvs.data-centricSWdevprocess

④ OpenproblemsinSE4DA

Page 13: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Part1.DataScientistsinSoftwareTeams:StateoftheArtandChallenges

MiryungKim,ThomasZimmermann,RobDeLine,AndrewBegel

Page 14: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

TheEmergingRolesofDataScientistsonSoftwareTeams

Weareatatippingpoint wheretherearelargescaletelemetry,machine,quality,anduserdata.

DatascientistsareemergingrolesinSWteams.

Tounderstandworkingstyles andchallenges,weconductedthefirstin-depthinterviewstudyandthelargestscalesurveyofprofessionaldatascientists.

① DataScientists

④Challenges

②Difference

③Tools

14

Page 15: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

MethodologyforStudying“DataScientists”In-DepthInterviews[ICSE’16]:

• 5womenand11menfromeightdifferentMicrosoftorganizations

Survey[TSE2018]793responses• demographics/self-

perception• skillsandtoolusage• workingstyles• timespent• challengesandbest

practicesComputerScience

Physics

Math

BioInformatics

Statistics

Economics

Finance

CogSci

ML15

① DataScientists

④Challenges

②Difference

③Tools

Page 16: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

TimeSpentonActivities

Hoursspentoncertainactivities(selfreported,survey,N=532)

16

① DataScientists

④Challenges

②Difference

③Tools

Page 17: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

!"#$#

##

$

$

$!

!

!

"

""

"#

#$!

!

$

#$$

"

"

#!!"

Clustering

532datascientistsatMicrosoft

basedonrelativetimespentinactivities

17

Whatisa“DataScientist”?

9DistinctCategories

① DataScientists

④Challenges

②Difference

③Tools

Page 18: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Category1:DataShaper

18

Analyzingandpreparingdata

Post-graduatedegrees

Algorithms,machinelearning,andoptimizations

Lessfamiliarwithfront-endprogramming

① DataScientists

④Challenges

②Difference

③Tools

Page 19: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Category2:PlatformBuilder

19

Instrumentcodetocollectdata

Bigdataanddistributedsystems

Back-endandfront-endprogramming

SQL,C,C++andC#

① DataScientists

④Challenges

②Difference

③Tools

Page 20: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Category3:DataAnalyzer

20

Familiarwithstatistics

Notfamiliarwithfront-endprogramming

Difficultywithdatatransformation

RStudioorstatisticalanalysis

① DataScientists

④Challenges

②Difference

③Tools

Page 21: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Validation isamajorchallenge.

“Honestly,wedon’thaveagoodmethodforthis.”“Justbecausethemathisright,doesn’tmeanthattheanswerisright.”

Explainability isimportant— “togaininsights,youmustgooneleveldeeper.”

Commonchallenges:Datascientistsfinditdifficulttoensure“correctness”

21

① DataScientists

④Challenges

②Difference

③Tools

Page 22: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

22

①DataScientists

④Challenges

②Difference

③Tools

Studies:Data

Scientists

Tools

Shifttodata-centricSWdevelopment

Debugging& testingforbigdataanalytics

DifferencesbetweentraditionalSWvs.data-centricSWdevprocess

④ OpenproblemsinSE4DA

Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)

Page 23: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

[Interactions’12] [ICSE-SEIP’19]

[NIPS’15] [TSE’19] [ICSE’16] [TSE’18]

①DataScientists

④Challenges

②Difference

③Tools

Part2.HowisTraditionalDevelopmentDifferentfromBigDataAnalyticsDevelopment?

Page 24: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Traditionalvs.BigDataAnalyticsDevelopment

1 Develop

2 Run

3 Test

4 Debug

5 Repeat

1 Developlocally

2 TestlocallywithSampleData

3 Executethejobonthecloudhopingthatitwouldwork

4 Severalhourslater,thejobcrashesorproduceswrongoutput

5 Repeat

24

①DataScientists

④Challenges

②Difference

③Tools

Page 25: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Traditionalvs.BigDataAnalyticsDevelopment

1 Developlocally

2 TestwithSample

1.Dataishuge,remote,anddistributed.

25

①DataScientists

④Challenges

②Difference

③Tools

Page 26: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

2 TestwithSample

2.Writingtest ishard.Don’tevenknowthefullinputanddon’tknowtheexpectedoutput.

Traditionalvs.BigDataAnalyticsDevelopment

26

3.Failuresarehardtodefine.

4Thejobcrashesorproduceswrongoutput

①DataScientists

④Challenges

②Difference

③Tools

Page 27: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

3Executethejobonthecloud

4.Systemstackis complexwith littlevisibility.

ReduceFilter Map

Traditionalvs.BigDataAnalyticsDevelopment

27

①DataScientists

④Challenges

②Difference

③Tools

Page 28: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

5.Gapbetweenlogicalvs. physicalexecution

Trips Zipcode

MapMap

Join:⨝

Map ReduceByKey

Filter

3Executethejobonthecloud

Traditionalvs.BigDataAnalyticsDevelopment

28

①DataScientists

④Challenges

②Difference

③Tools

Page 29: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

4Thejobcrashesorproduceswrongoutput

5 Repeat

6.Data tracingis hard.

3Executethejobonthecloud

Traditionalvs.BigDataAnalyticsDevelopment

Task 31 failed 3 times; aborting jobERROR Executor: Exception in task 31 in stage 0 (TID 31)java.lang.NumberFormatException

29

①DataScientists

④Challenges

②Difference

③Tools

Page 30: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

30

①DataScientists

④Challenges

②Difference

③Tools

③ Tools

Shifttodata-centricSWdevelopment

Debugging& testingforbigdataanalytics

DifferencesbetweentraditionalSWvs.data-centricSWdevprocess

④ OpenproblemsinSE4DA

Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)

Studies:Data

Scientists

Page 31: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Part3.DebuggingandTestingforBigDataAnalytics

TysonCondie,AriEkmekji,MuhammadAliGulzar,MiryungKim,MatteoInterlandi,ShaghayeghMardani,ToddMillstein,MadanlalMusuvathi,KshitijShah,SaiDeepTetali,SeunghyunYoo

Page 32: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Insights from DebuggingandTestingforApacheSpark

• Designinginteractivedebugprimitivesrequiresdeepunderstandingofinternal executionmodel,jobscheduling,andmaterialization.

• Providingtraceabilityrequiresmodifyingaruntime.

• Abstraction isapowerfulforceinsimplifyingprogrampaths.

32

①DataScientists

④Challenges

②Difference

③Tools

Page 33: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

• Pausingtheentirecomputationontheclustercouldreducethroughput

• Itisclearlyinfeasibleforausertoinspectbillionofrecordsthrougharegularwatchpoint

①DataScientists

④Challenges

②Difference

③Tools

Enablinginteractivedebuggingrequiresustore-thinkatraditionaldebugger

33

Page 34: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Stage2Stage1

BigDebug:InteractiveDebugPrimitivesforBigDataAnalytics[ICSE2016]

Filter

Program(DAG) MapMap Reduce MapMap

Reduce

①SimulatedBreakpoint

age < 0

StoredData

Records

②OnDemandWatchpoint

③ RealtimeRepair

Map

④BackwardTracing

34

①DataScientists

④Challenges

②Difference

③Tools

Page 35: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Titian:DataProvenanceforApacheSpark[VLDB2016]

Stage2Stage1Filter MapMap Reduce MapMap

Program(DAG)

LineageTable

⨝Worker3

Worker2

Worker1

⨝�

Worker3

Worker2

Worker1

35

①DataScientists

④Challenges

②Difference

③Tools

Page 36: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

TitianDataProvenance

⨝Worker3

Worker2

Worker1

⨝ �⨝Worker3

Worker2

Worker1

⨝ ��DeltaDebugging

��

BigSift:AutomatedDebuggingofBigDataAnalytics[SoCC2017]

Input:AProgram,ATestFunction Output:FaultyRecords

36

①DataScientists

④Challenges

②Difference

③Tools

TestPredicatePushdown

PrioritizingBackwardTraces

BitmapbasedMemoization

Page 37: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

• BigDebugenablesinteractivedebuggingandrepair,whileretainingthescale-up property.Itposesatmost34% overhead [ICSE2016].

• Titian’sdataprovenanceisordersofmagnitudefasterthanalternatives[VLDB2016].

• BigSift automatically findsbugs66Xfasterthandeltadebugging.Ittakes62%lesstimetodebugthantheoriginaljob’srun[SoCC2017].

ResultsonDebuggingofBigDataAnalytics

37

①DataScientists

④Challenges

②Difference

③Tools

Page 38: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

WhyisTestingBigDataAnalyticsChallenging?

Option1:SampleData

• randomsampling,

• topnsampling

• topk%sample,etc.

Limitations:• Lowcodecoverage

• Orincreasedlocaltestingtime

Option2:TraditionalTesting• 700KLOCforApacheSpark

38

①DataScientists

④Challenges

②Difference

③Tools

Limitations:• Symbolicexecutionwithoutabstractionwouldnotscale.

Page 39: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

BigTest:White-BoxTestingofBigDataAnalytics[ESEC/FSE2019]

Relationalskeleton700KLOCSpark

Userdefinedfunc

JOIN:�tR,tL: cR � CR � cL � CL �cR(tR) � tR,key = tL,key �cL(tL)

PathConstraint Effect

T.split(",").length ≥ 1 �…� V2 = ”ERROR" …

"\x00", "Palms"

LogicalSpecifications

SymbolicExecution

Abstract

Extract

Stringoperations

Model

39

Z.split(“,”)[1]=“Palms” �Z.split(“,”).length >1 �T.split(“,”)[1] = Z.split(“,”)[0] �T.split(“,”).length >1 � …

StringConstraints

①DataScientists

④Challenges

②Difference

③Tools

Page 40: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

6 5 14 11 430

6

4.00E+09

5.21E+05

4.48E+08 3.20E+08 2.40E+08 4.00E+07 1.11E+08

1E+00 1E+02 1E+04 1E+06 1E+08 1E+10

IncomeAggregate

MovieRatings AirportLayover

CommuteType PigMixL2 GradeAnalysis WordCount

TestDatasetSize

BigTest EntireDataset

# o

f R

ow

s

BigTestreducestestsby105Xto108X,achieving194Xtestingspeedup.

40

TestSizeReduction

①DataScientists

④Challenges

②Difference

③Tools

Page 41: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

41

①DataScientists

④Challenges

②Difference

③Tools

③ Tools

Shifttodata-centricSWdevelopment

Debugging& testingforbigdataanalytics

DifferencesbetweentraditionalSWvs.data-centricSWdevprocess

④ OpenproblemsinSE4DA

Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)

Studies:Data

Scientists

Page 42: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

2004 2014 20192008 20252022

42

DA4SE SE4DA

Part4.RoadmapforAcceleratingData-CentricDevelopment

Page 43: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Insight1:Debuggingdataanalyticsrequiresbothdataandcodeanalysis.

Howtodefineabugbasedonthepropertiesofbothdataand code?

Howtorepair bothcode anddataerrors?

DataX-Ray[SIGMOD’15]

DataWrangling[CHI’11]

ProgramRepair[ICSE’09][ICSE’13],etc.

DataCleaning[VLDB’01][VLDB’15][SIGMOD‘15][SIGMOD’10]

DataRepair[VLDB’11][SIGMOD‘14]

BugPatterns[SIGPLAN2004],etc.

43

①DataScientists

④Challenges

②Difference

③Tools

Page 44: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

39.5%

28.5%

6.5%

13.0%

12.5%

PerformanceComprehensionInstallation and Environment SettingAPI UsageCorrectness

Insight2:Performancedebuggingisapainpoint.

Manual inspection of top 200 Spark related posts

from Stack Overflow

44

①DataScientists

④Challenges

②Difference

③Tools

7.6%

16.5%

16.5%

21.5%

25.3%

5.1%

7.6%Comprehension-related issueConfiguration TuningPerformance ScalingInefficient operatorUnbalanced task IO-related issueMemory-related issue

Page 45: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Insight2:Performancedebuggingrequiresvisibilityofsystemstack,code,anddata.

Storage JVM

CPU GPU FPGA

Runtime

DevEnvironment

Containers

ML/AILib

45

Howtoestimateperformancebasedondatasize?

Howtooptimizequeryperformanceusingacostmodel?

Howtodebugcomputationanddataskews?

Howtoidentifythecauseofbottlenecks?

Ernest[NSDI’16]

Neo [VLDB’16]

PerfDebug[SoCC’19]

Skewtune[SIGMOD’12]

CausalProfiling[SOSP’15]

CausalMonitoring[SOSP’15]

① DataScientists

④Challenges

②Difference

③Tools

Page 46: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Insight3:Wemustrelaxthestrictnotionofanincorrectbehaviorandtherootcause.

Howtospecifyoraclesfordata-centricsoftware?Metamorphicrelationsaresimpleorhardtodefine

Howtoquantifyimportancewhendebuggingfaultyinputsfordataanalytics?

DeepTest[ICSE2018]

DeepConcolic[ASE2018]

DeepHunter[ISSTA2019]

MetamorphicTesting[1998]

Lamp[ESEC/FSE2017]

46

MODE[ESEC/FSE’18]

InfluenceFunction[ICML’17]

TrainingSetDebugging[AAAI’18]

LIME[KDD’16]

① DataScientists

④Challenges

②Difference

③Tools

Page 47: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

Conclusion:HopeforSoftwareEngineeringforDataAnalytics(SE4DA)

Weareataninflectionpoint.SE4DAisunderserved.

ProgresshasbeenmadeinSE4DA byre-thinkingsoftwareengineeringforbigdataanalytics.

WecantogetherworkonopenproblemsinSE4DA.

47

Page 48: Re-Engineering Software Engineering ina Data-Centric Worldweb.cs.ucla.edu/~miryung/ASE2019-Keynote-MiryungKim-SE4DA.pdf · Shift to data-centric SW development Debugging &testing

SE4DA: AI,BigData,andMLneedawesomeSEtools

� Debugging� Intelligentsampling

andtesting� Rootcauseanalysis

� Datacleaning � Performanceanalytics

� Codeanalytics

Diagnose Fix Optimize