37
1 © 2016 CA. ALL RIGHTS RESERVED. @CAWORLD #CAWORLD World ® ’1 6 Applying Data Science to Your Business Problem Paul Dulany - VP Data Science - CA Technologies SCX31S SECURITY

Applying Data Science to Your Business Problem

Embed Size (px)

Citation preview

1 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

World®’16

ApplyingDataSciencetoYourBusinessProblem

PaulDulany - VPDataScience- CATechnologies

SCX31S

SECURITY

2 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

©2016CA.Allrightsreserved.Alltrademarksreferencedhereinbelongtotheirrespectivecompanies.

Thecontentprovidedinthis CAWorld2016presentationisintendedforinformationalpurposesonlyanddoesnotformanytypeofwarranty. The informationprovidedbyaCApartnerand/orCAcustomerhasnotbeenreviewedforaccuracybyCA.

ForInformationalPurposesOnlyTermsofthisPresentation

3 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Abstract

Forawhilenow,anumberofindustrieshavebeeninterestedindatascienceandadvancedanalytics.Butitisn’talwaysclearhowbesttousethesewithinthebusinesscontext.Inthissession,we’lldiscusshowtoturnabusinessproblemintoadata-scienceproblem,andthenback.We’llusecard-not-presentpaymentfraudandloginattemptsasexamplesofhowtoidentifytheproblem,determineifdatascienceandadvancedanalyticscanhelp(andifthesituationwarrantsthem),andthenfollowthroughondevelopingasolutiontotheproblem.

PaulDulany,PhDCATechnologiesVPDataScience

4 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Agenda

WHATISDATASCIENCE?

DETERMININGAPROBLEMOFINTEREST

UNDERSTANDTHEPRODUCTIONENVIRONMENTANDDEMANDS

MODELCREATIONANDEVALUATION

Q&A

1

2

3

4

5

5 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

KeyPointsforApplyingDataScience

§ Identifyahigh-valueBusinessProblemwithHighQualityData

§ Determinetheclassoftheproblemtosolve

§ Utilizebusiness-domainknowledge– Understandthe"ecosystem"– Defineappropriatemetrics– Understandthedatainfull

§ Developfeaturesandmodels/Evaluate/Iterate

§ Alwayskeepthebusinessprobleminmind!

6 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

WhatisDataScience?

§ Theapplicationofanalyticaltechniquestolargeand“big”data– Awidefieldencompassingmanydifferentaspectsofanalytics,

statistics,anddatamining– Fundamentallydatadriven– Baseduponthescientificmethod– Thegoalistousedataandanalyticaltechniquestosolveproblems

§ Requiresknowledgeinmultipledomains– Analytics– Scientificcomputations

– Dataformats– Businessdomain

– Statistics

7 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ Identifyahigh-valuebusinessproblem– Thebusinesscaseiscritical

§ IntelligentMainframeOperations– Needearlydetectionofissues

§ Bestistopredictandavoidissues– Currently,falsepositives(falsealarms)aretooprevalent– Expert-maintainedsystemsofthresholdsarehardtomaintain

BusinessProblem

8 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ PaymentSecurity:– FraudineCommerceisasignificantproblem

§ 3-DSecurewasdevelopedtocombatthis– Issuersincurthemostpainfromthecurrentstate

§ Fraudlosses§ Lossofincomefrominterestandinterchangefees§ Customerexperienceandannoyance§ Costofinboundcalls

– Merchantsfeelpaintoo…

BusinessProblem

9 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

BusinessImpactforaClient:3YearImpact

Credit DebitFraudsavings(abovecurrentvendor) £13,441,843 £15,174,365

Interchangefees(abovecurrent) £466,462 £3,441,882

Interest income(abovecurrent) £3,674,803 N/A

Operationalsavings(notcalculated) - -

Total £17,583,108 £18,616,247

10 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ Identifydatarelatedtothebusinessproblem– Themoredata,thebetter!– Isitcategorical,ordinal,ornumerical?Whatcardinality?– Isthereauniqueadvantageoverthecompetition?

§ PaymentSecurity– 3DSdata:PAReq message,deviceinformation,…– Widemixoftypesofdata– Timeseriesisimportant– SaaSDeploymentallowsqualitydatatobegathered

IdentifyData(ResultsProportionaltoQuality!)

11 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ IntelligentMainframeOperations– Multiplepossibledatasources

§ VSAM,DB2,IMSDB,IDMS,DATACOM,SMF,Syslogs,Vtape,CICS,…– UtilizeCASYSVIEW’sexcellence– Embedanalyticstodetectabnormalpatterns

IdentifyData(ResultsProportionaltoQuality!)

12 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ Determinethegeneralclassofproblem– Classification,regression,anomalydetection,etc.– “Supervised”(teachingyourchildrentoread,teachingthemmanners)– “Unsupervised”(university)– “Semi-supervised”(schoollunchroom)

13 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ PaymentSecurity– Supervisedclassification– Fraudinformationforlossesislikelyingoodshape– Complexitieshappenonceyouhaveasysteminplace

§ Censoredproblem,bothinmarkingandinchangingfraudsterbehavior

§ IntelligentMainframeOperations– Unsupervisedtobegin– Needtodevelopbaselinesofnormalbehavior

§ Butmustprovideresultsfromday0– Possibilityofsemi-supervisedinthefuture

Determinethegeneralclassofproblem

14 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ Understandtheecosystem– Whatactionscanbetaken?– Isgettingtheresulttimesensitive?

§ IntelligentMainframeOperations– Predictiveanalyticsneeded– Multiplepossibleactions,keyistoinformtheoperator’sactions– Differenttime-scalesforproblems– Reaction-timeiscritical– real-time

15 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinapplyingDataScience

§ PaymentSecurity– Predictiveandprescriptiveanalyticsneeded– Multiplepossibleactions,atthetransactionandthecardlevel– Timingiscritical– real-time,i.e.,<50msforvastmajority

§ Wemustbeabletotakeactionnow

UnderstandtheEcosystem

16 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ Determinetheappropriatemetrics– Howdowemeasuresuccess?

§ Welldefinedmeasuresarecritical

§ IntelligentMainframeOperations– HighAvailability– Problemavoidance– ReducedMTTR– ReduceSMEdependenceforissuedetection

17 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

StepsinApplyingDataScience

§ PaymentSecurity– Anumberofpossibilities,baseduponcustomer’sobjectives– Considerthemall

§ Detectionrates§ “Outsort”rates§ False-positiveratios§ False-positiverates§ Value-based/transactionbased/cardbased

Metrics

TOR 𝑆 = ∑ 𝐹 𝑠* + 𝑁 𝑠*�./0.

∑ 𝐹 𝑠* + 𝑁 𝑠*�122.*

TDR 𝑆 = ∑ 𝐹 𝑠*�./0.

∑ 𝐹 𝑠*�122.*

18 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

What’sNext?

§ Nowwehaveawelldefinedproblem

§ Sowespendalotoftimewiththedata!– “Browse”thedata– Runsomedescriptivestatistics– Seeifyoucansurpriseyourself– Ifsupervised,viewthetaggingdataandthe

productiondataseparately,andthentogether.

19 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

ReviewtheTaggedData

§ Browsethedataagain

§ Runmanystatisticaltest– Bewareof“TargetLeaks”!!– Begingettingafeelforthevariations,correlations,idiosyncrasies

§ Youneverwantperfectlycleandata– Youwantdatathatsimulatesproduction!

§ Bewareofanychangestothedata,especiallynon-causalchanges§ Modeltrainingisanumericalsimulationofproduction

20 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

DataPartitioning:BewareWhatYouPartitionandHow!

§ Partitioning– Oftenneedtousestratified

sampling– Whenusingmultipleentities

fortrackingbehavior,interactionsaretricky!§ Lookforirreducibility§ Gotoout-of-timeifneeded

HistoricalData

Training

Fraud

Non-Fraud

Validate

Fraud

Non-Fraud

Holdout

Fraud

Non-Fraud

21 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

FeatureDevelopment

§ “Features”pullthedistinguishingcharacteristicsfromthedata– Timeseriesanalysistechniques– DigitalSignalProcessingtechniques– Statisticalmeasuresofdifferences– Bayesianapproaches– Lineardiscriminants– Non-lineartransformations– …

22 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

FeatureDevelopment:SimpleExample

§ Determineapeergroupfromthehistoricaldata– Alltransactionswheretherewerefourprevioustransactionsinthelast

week,atleasttwoofwhichwereonthesamedevice,butindifferentcountries

§ Determinethedistributionsofclassesforacontinuousvariable– Let’ssay,theamount– Useadiscriminantcalculationto

determinelikelihoodofbelongingtoeitherclass.

23 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

FeatureDevelopment:MoreComplexity

§ Attheotherendofthescaleareonline-learningmodelstodeterminebehaviors– Autocorrelationmodels,exponentialweighting,KDE,etc.– Manytechniques–

§ butmustkeepinmindtheCPUconstraints,I/Oconstraints,etc.

§ Conversionofhigh-cardinalitycategoricaldataintonumericalinputs

𝑥"(𝑡𝑛 , 𝑡𝑛−1, 𝑡0) = 𝛼(𝑡𝑛 , 𝑡0)𝑥𝑛 + 𝛽(𝑡𝑛 , 𝑡𝑛−1, 𝑡0)𝑥"𝑛−1

𝛼(𝑡𝑛 , 𝑡0) =1 − 𝛾

1 − 𝛾(𝑡𝑛−𝑡0)

𝛽(𝑡𝑛 , 𝑡𝑛−1, 𝑡0) =𝛾(𝑡𝑛−𝑡𝑛−1)11 − 𝛾(𝑡𝑛−1−𝑡0)2

1 − 𝛾(𝑡𝑛−𝑡0)

24 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Example:UnsupervisedAnomalyDetection

§ UtilizeHistoricaldatatodefinebandsofdifferentprobabilities– Maprealtimemetricstreamsagainstsystemdefinednormal– Multi-pointalertsgeneratedusingindustry-standardWestern-Electric

rules– Makestaticthresholdsoptional!

Unlikely

MostLikely

Metric

Time

LessLikely

25 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Example:SupervisedModel

§ AlwaysrememberOccam'sRazor!– Amongcompetinghypotheses,theonewiththefewestassumptions

shouldbeselected.– Avoidneedlesscomplexity

§ Startwithsimplemodels,andgrowmorecomplexasneeded– Linearregression– Logisticregression– Decisiontrees– NeuralNetworks– SVM…

26 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Example:SupervisedModelNeuralNetwork

27 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Example:SupervisedModel

§ Therearemanyaspectsoftraininganeuralnetwork– differentactivationanderrorfunctions– differenttrainingalgorithms,– variationsofseeds,learningrate,momentum,– self-regulation,– numberofhiddenlayers,– numberofnodes,– boosting/bagging,– preventingoverfitting,– etc.

TraintheModel(s)

28 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Review

§ Reviewtheresultsofyourtraining,andstartalloveragain!– Considersegmentation– Tryleavingoutthevariableswiththehighestsensitivity– Subdividethedatatoseeifthereareregionsofinstability– Iterateasneeded

§ Finally,selectyourmodel(s)!

§ Butwe’renotdone…– Nowworryaboutcalibration,upgrade/downgrade,primingtime,

packaging,integration,modelreport,…

29 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

ModelPerformanceChart

30 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

IntelligentMainframeOperations

TypicalVolatility

Anomaly

Tasksreadytobe

disp

atched

31 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

KeyPointsforApplyingDataScience

§ Identifyahigh-valueBusinessProblemwithHighQualityData

§ Determinetheclassoftheproblemtosolve

§ Utilizebusiness-domainknowledge– Understandthe"ecosystem"– Defineappropriatemetrics– Understandthedatainfull

§ Developfeaturesandmodels/Evaluate/Iterate

§ Alwayskeepthebusinessprobleminmind!

32 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

RecommendedSessions

SESSION# TITLE DATE/TIME

SCX50S ConvenienceandSecurityforbankingcustomerswithCAAdvancedAuthentication

11/17/2016at3:00pm

SCX34S SecuringMobilePayments:ApplyingLessonsLearnedintheRealWorld

11/17/2016at3:45pm

SCT05T ThreatAnalyticsforPrivilegedAccessManagement 11/17/2016at4:30pm

33 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Don’tMissOurINTERACTIVESecurityDemoExperience!

SNEAKPEEK!

33 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

34 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

WeWanttoHearFromYou!

§ ITCentralisaleadingtechnologyreviewsite.CAhasthemtohelpgenerateproductreviewsforourSecurityproducts.

§ ITCSstaffmaybeatthissessionnow!(lookfortheirshirts).Ifyouwouldliketoofferaproductreview,pleaseaskthemaftertheclass,orgobytheirbooth.

Note:§ Onlytakes5-7mins§ Youhavetotalcontroloverthereview§ Itcanbeanonymous,ifrequired

35 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Questions?

36 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Stayconnectedatcommunities.ca.com

Thankyou.

37 ©2016CA.ALLRIGHTSRESERVED.@CAWORLD#CAWORLD

Security

FormoreinformationonSecurity,pleasevisit:http://cainc.to/EtfYyw