Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Re-EngineeringSoftwareEngineeringin aData-CentricWorld
MiryungKimUniversityofCalifornia,LosAngeles
1
Keynote atASE2019,The34thIEEE/ACMInternationalConferenceonAutomatedSoftwareEngineering
Confluence
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote2
Confluence:InterdisciplinaryThinking
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote
Inflection Point
Confluence:Impressionism
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote
Inflection Point
Confluence:DataAnalyticsandSE
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote
Inflection Point
AI
BigData
ML
TakeawayMessage:ACaseforSoftwareEngineeringforDataAnalytics(SE4DA)
Bugfindingisahugeproblemindataanalytics.
SE4DA isunderserved;somehowpeoplehavegravitatedtoapplyingdataanalyticstoSE.
SE4DA requiresre-thinkingsoftwareengineeringtechniques.
6
Thereisahugeopportunity fordataanalytics.
7
Dataanalyticsareinhighdemand,yet…
8
Bugsarehugeproblemsindataanalytics.
9
Dataanalyticsusedbythousandsofscientistsproducemisleading orwrong results[BBCNews]
Thewidespreadharmincludesfromawrongmedicaldiagnosistoincorrectinterpretationofstockhistory[Dataversity]
Predictablyinaccurate:Theprevalenceandperilsofbadbigdata.[Deloitte]
GrowthofDataAnalyticsPapersinSE
21 22 28 4738 50 40
39
0
50
100
2016 2017 2018 2019
DataAnalytics(AI,BigData,ML)GrowthinASEPapers
DataAnalytics Rest
10
SE4DA isunder-investigated.(SE4DA:13,DA4SE:105)
SE4DA
4%
DA4SE
37% Rest
59%
11
SE4DA(4%):ImprovingSEfordataanalytics
DA4SE(37%):ApplyingdataanalyticstoSE
Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)
①
②
③
Studies:Data
Scientists
Tools
Shifttodata-centricSWdevelopment
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSWvs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Part1.DataScientistsinSoftwareTeams:StateoftheArtandChallenges
MiryungKim,ThomasZimmermann,RobDeLine,AndrewBegel
TheEmergingRolesofDataScientistsonSoftwareTeams
Weareatatippingpoint wheretherearelargescaletelemetry,machine,quality,anduserdata.
DatascientistsareemergingrolesinSWteams.
Tounderstandworkingstyles andchallenges,weconductedthefirstin-depthinterviewstudyandthelargestscalesurveyofprofessionaldatascientists.
① DataScientists
④Challenges
②Difference
③Tools
14
MethodologyforStudying“DataScientists”In-DepthInterviews[ICSE’16]:
• 5womenand11menfromeightdifferentMicrosoftorganizations
Survey[TSE2018]793responses• demographics/self-
perception• skillsandtoolusage• workingstyles• timespent• challengesandbest
practicesComputerScience
Physics
Math
BioInformatics
Statistics
Economics
Finance
CogSci
ML15
① DataScientists
④Challenges
②Difference
③Tools
TimeSpentonActivities
Hoursspentoncertainactivities(selfreported,survey,N=532)
16
① DataScientists
④Challenges
②Difference
③Tools
!"#$#
##
$
$
$!
!
!
"
""
"#
#$!
!
$
#$$
"
"
#!!"
Clustering
532datascientistsatMicrosoft
basedonrelativetimespentinactivities
17
Whatisa“DataScientist”?
9DistinctCategories
…
① DataScientists
④Challenges
②Difference
③Tools
Category1:DataShaper
18
Analyzingandpreparingdata
Post-graduatedegrees
Algorithms,machinelearning,andoptimizations
Lessfamiliarwithfront-endprogramming
① DataScientists
④Challenges
②Difference
③Tools
Category2:PlatformBuilder
19
Instrumentcodetocollectdata
Bigdataanddistributedsystems
Back-endandfront-endprogramming
SQL,C,C++andC#
① DataScientists
④Challenges
②Difference
③Tools
Category3:DataAnalyzer
20
Familiarwithstatistics
Notfamiliarwithfront-endprogramming
Difficultywithdatatransformation
RStudioorstatisticalanalysis
① DataScientists
④Challenges
②Difference
③Tools
Validation isamajorchallenge.
“Honestly,wedon’thaveagoodmethodforthis.”“Justbecausethemathisright,doesn’tmeanthattheanswerisright.”
Explainability isimportant— “togaininsights,youmustgooneleveldeeper.”
Commonchallenges:Datascientistsfinditdifficulttoensure“correctness”
21
① DataScientists
④Challenges
②Difference
③Tools
22
①DataScientists
④Challenges
②Difference
③Tools
①
②
③
Studies:Data
Scientists
Tools
Shifttodata-centricSWdevelopment
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSWvs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)
[Interactions’12] [ICSE-SEIP’19]
[NIPS’15] [TSE’19] [ICSE’16] [TSE’18]
①DataScientists
④Challenges
②Difference
③Tools
Part2.HowisTraditionalDevelopmentDifferentfromBigDataAnalyticsDevelopment?
Traditionalvs.BigDataAnalyticsDevelopment
1 Develop
2 Run
3 Test
4 Debug
5 Repeat
1 Developlocally
2 TestlocallywithSampleData
3 Executethejobonthecloudhopingthatitwouldwork
4 Severalhourslater,thejobcrashesorproduceswrongoutput
5 Repeat
24
①DataScientists
④Challenges
②Difference
③Tools
Traditionalvs.BigDataAnalyticsDevelopment
1 Developlocally
2 TestwithSample
1.Dataishuge,remote,anddistributed.
25
①DataScientists
④Challenges
②Difference
③Tools
2 TestwithSample
2.Writingtest ishard.Don’tevenknowthefullinputanddon’tknowtheexpectedoutput.
Traditionalvs.BigDataAnalyticsDevelopment
26
3.Failuresarehardtodefine.
4Thejobcrashesorproduceswrongoutput
①DataScientists
④Challenges
②Difference
③Tools
3Executethejobonthecloud
4.Systemstackis complexwith littlevisibility.
ReduceFilter Map
Traditionalvs.BigDataAnalyticsDevelopment
27
①DataScientists
④Challenges
②Difference
③Tools
5.Gapbetweenlogicalvs. physicalexecution
Trips Zipcode
MapMap
Join:⨝
Map ReduceByKey
Filter
3Executethejobonthecloud
Traditionalvs.BigDataAnalyticsDevelopment
28
①DataScientists
④Challenges
②Difference
③Tools
4Thejobcrashesorproduceswrongoutput
5 Repeat
6.Data tracingis hard.
�
3Executethejobonthecloud
Traditionalvs.BigDataAnalyticsDevelopment
Task 31 failed 3 times; aborting jobERROR Executor: Exception in task 31 in stage 0 (TID 31)java.lang.NumberFormatException
29
①DataScientists
④Challenges
②Difference
③Tools
30
①DataScientists
④Challenges
②Difference
③Tools
①
②
③ Tools
Shifttodata-centricSWdevelopment
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSWvs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)
Studies:Data
Scientists
Part3.DebuggingandTestingforBigDataAnalytics
TysonCondie,AriEkmekji,MuhammadAliGulzar,MiryungKim,MatteoInterlandi,ShaghayeghMardani,ToddMillstein,MadanlalMusuvathi,KshitijShah,SaiDeepTetali,SeunghyunYoo
Insights from DebuggingandTestingforApacheSpark
• Designinginteractivedebugprimitivesrequiresdeepunderstandingofinternal executionmodel,jobscheduling,andmaterialization.
• Providingtraceabilityrequiresmodifyingaruntime.
• Abstraction isapowerfulforceinsimplifyingprogrampaths.
32
①DataScientists
④Challenges
②Difference
③Tools
• Pausingtheentirecomputationontheclustercouldreducethroughput
• Itisclearlyinfeasibleforausertoinspectbillionofrecordsthrougharegularwatchpoint
①DataScientists
④Challenges
②Difference
③Tools
Enablinginteractivedebuggingrequiresustore-thinkatraditionaldebugger
33
Stage2Stage1
BigDebug:InteractiveDebugPrimitivesforBigDataAnalytics[ICSE2016]
Filter
Program(DAG) MapMap Reduce MapMap
Reduce
①SimulatedBreakpoint
age < 0
StoredData
Records
②OnDemandWatchpoint
③ RealtimeRepair
Map
�
④BackwardTracing
34
①DataScientists
④Challenges
②Difference
③Tools
Titian:DataProvenanceforApacheSpark[VLDB2016]
Stage2Stage1Filter MapMap Reduce MapMap
Program(DAG)
LineageTable
⨝Worker3
Worker2
Worker1
⨝�
Worker3
Worker2
Worker1
�
35
①DataScientists
④Challenges
②Difference
③Tools
TitianDataProvenance
⨝Worker3
Worker2
Worker1
⨝ �⨝Worker3
Worker2
Worker1
⨝ ��DeltaDebugging
��
�
�
�
�
�
�
�
BigSift:AutomatedDebuggingofBigDataAnalytics[SoCC2017]
Input:AProgram,ATestFunction Output:FaultyRecords
36
①DataScientists
④Challenges
②Difference
③Tools
TestPredicatePushdown
PrioritizingBackwardTraces
BitmapbasedMemoization
• BigDebugenablesinteractivedebuggingandrepair,whileretainingthescale-up property.Itposesatmost34% overhead [ICSE2016].
• Titian’sdataprovenanceisordersofmagnitudefasterthanalternatives[VLDB2016].
• BigSift automatically findsbugs66Xfasterthandeltadebugging.Ittakes62%lesstimetodebugthantheoriginaljob’srun[SoCC2017].
ResultsonDebuggingofBigDataAnalytics
37
①DataScientists
④Challenges
②Difference
③Tools
WhyisTestingBigDataAnalyticsChallenging?
Option1:SampleData
• randomsampling,
• topnsampling
• topk%sample,etc.
Limitations:• Lowcodecoverage
• Orincreasedlocaltestingtime
Option2:TraditionalTesting• 700KLOCforApacheSpark
38
①DataScientists
④Challenges
②Difference
③Tools
Limitations:• Symbolicexecutionwithoutabstractionwouldnotscale.
BigTest:White-BoxTestingofBigDataAnalytics[ESEC/FSE2019]
Relationalskeleton700KLOCSpark
Userdefinedfunc
JOIN:�tR,tL: cR � CR � cL � CL �cR(tR) � tR,key = tL,key �cL(tL)
PathConstraint Effect
T.split(",").length ≥ 1 �…� V2 = ”ERROR" …
"\x00", "Palms"
LogicalSpecifications
SymbolicExecution
Abstract
Extract
Stringoperations
Model
39
Z.split(“,”)[1]=“Palms” �Z.split(“,”).length >1 �T.split(“,”)[1] = Z.split(“,”)[0] �T.split(“,”).length >1 � …
StringConstraints
①DataScientists
④Challenges
②Difference
③Tools
6 5 14 11 430
6
4.00E+09
5.21E+05
4.48E+08 3.20E+08 2.40E+08 4.00E+07 1.11E+08
1E+00 1E+02 1E+04 1E+06 1E+08 1E+10
IncomeAggregate
MovieRatings AirportLayover
CommuteType PigMixL2 GradeAnalysis WordCount
TestDatasetSize
BigTest EntireDataset
# o
f R
ow
s
BigTestreducestestsby105Xto108X,achieving194Xtestingspeedup.
40
TestSizeReduction
①DataScientists
④Challenges
②Difference
③Tools
41
①DataScientists
④Challenges
②Difference
③Tools
①
②
③ Tools
Shifttodata-centricSWdevelopment
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSWvs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Outline:MakingaCaseforSoftwareEngineeringforDataAnalytics(SE4DA)
Studies:Data
Scientists
2004 2014 20192008 20252022
42
DA4SE SE4DA
Part4.RoadmapforAcceleratingData-CentricDevelopment
Insight1:Debuggingdataanalyticsrequiresbothdataandcodeanalysis.
Howtodefineabugbasedonthepropertiesofbothdataand code?
Howtorepair bothcode anddataerrors?
DataX-Ray[SIGMOD’15]
DataWrangling[CHI’11]
ProgramRepair[ICSE’09][ICSE’13],etc.
DataCleaning[VLDB’01][VLDB’15][SIGMOD‘15][SIGMOD’10]
DataRepair[VLDB’11][SIGMOD‘14]
BugPatterns[SIGPLAN2004],etc.
43
①DataScientists
④Challenges
②Difference
③Tools
39.5%
28.5%
6.5%
13.0%
12.5%
PerformanceComprehensionInstallation and Environment SettingAPI UsageCorrectness
Insight2:Performancedebuggingisapainpoint.
Manual inspection of top 200 Spark related posts
from Stack Overflow
44
①DataScientists
④Challenges
②Difference
③Tools
7.6%
16.5%
16.5%
21.5%
25.3%
5.1%
7.6%Comprehension-related issueConfiguration TuningPerformance ScalingInefficient operatorUnbalanced task IO-related issueMemory-related issue
Insight2:Performancedebuggingrequiresvisibilityofsystemstack,code,anddata.
Storage JVM
CPU GPU FPGA
Runtime
DevEnvironment
Containers
ML/AILib
45
Howtoestimateperformancebasedondatasize?
Howtooptimizequeryperformanceusingacostmodel?
Howtodebugcomputationanddataskews?
Howtoidentifythecauseofbottlenecks?
Ernest[NSDI’16]
Neo [VLDB’16]
PerfDebug[SoCC’19]
Skewtune[SIGMOD’12]
CausalProfiling[SOSP’15]
CausalMonitoring[SOSP’15]
① DataScientists
④Challenges
②Difference
③Tools
Insight3:Wemustrelaxthestrictnotionofanincorrectbehaviorandtherootcause.
Howtospecifyoraclesfordata-centricsoftware?Metamorphicrelationsaresimpleorhardtodefine
Howtoquantifyimportancewhendebuggingfaultyinputsfordataanalytics?
DeepTest[ICSE2018]
DeepConcolic[ASE2018]
DeepHunter[ISSTA2019]
MetamorphicTesting[1998]
Lamp[ESEC/FSE2017]
46
MODE[ESEC/FSE’18]
InfluenceFunction[ICML’17]
TrainingSetDebugging[AAAI’18]
LIME[KDD’16]
① DataScientists
④Challenges
②Difference
③Tools
Conclusion:HopeforSoftwareEngineeringforDataAnalytics(SE4DA)
Weareataninflectionpoint.SE4DAisunderserved.
ProgresshasbeenmadeinSE4DA byre-thinkingsoftwareengineeringforbigdataanalytics.
WecantogetherworkonopenproblemsinSE4DA.
47
SE4DA: AI,BigData,andMLneedawesomeSEtools
� Debugging� Intelligentsampling
andtesting� Rootcauseanalysis
� Datacleaning � Performanceanalytics
� Codeanalytics
Diagnose Fix Optimize