Upload
mirko-kaempf
View
509
Download
0
Embed Size (px)
Citation preview
1©Cloudera,Inc.Allrightsreserved.
EnterpriseMetadataIntegrationMirko Kämpf |Cloudera
GraphConnect 2017– London
2©Cloudera,Inc.Allrightsreserved.
Whoisspeaking?SolutionsArchitect@Cloudera
-timeseriesanalysis,networkanalysis,dataenrichmentpipelines-personalinterest:QA-Systemsandsemanticsearch
DataScienceActivitiesTheDetectionofEmergingTrendsUsingWikipediaTrafficDataandContextNetworks(PLOSONE,2015)
Hadoop.TS (IJCA,2013)
Fluctuations inWikipediaAccess-RateandEdit-EventData.(Physica A,2012).
3©Cloudera,Inc.Allrightsreserved.
OurApproach:MultilayerMetadataIntegration…
• StatusdashboardsareprovidedperTopic/Use-Case.• Eachdashboardoffersfactsfrommultiplelayers:- (L1)Clusterspecificmetadata- (L2)Hadoopspecificops-metadata(only)- (L3)Applicationspecificops-metadata- (L4)Qualitymetricsandderivedfacts
• CurrentProjectStatus:• GraphdatabaseNeo4J andCypherallowcontextexploration.• Clusterspanningmetadataexplorationispossible.• Exposureofinherentbutsometimeshiddenfacts becomesaseasyaswritinganemail.
Integrationoffactstogainbusinessknowledge
4©Cloudera,Inc.Allrightsreserved.
Agenda
EMI- EnterpriseMetadataIntegration• Idea&Vision• Material• Skills/Methods• Tools
5©Cloudera,Inc.Allrightsreserved.
HowToBecomeDataDriven?Treat“dataasaresource“foryourbusiness.Thinkintermsofdatasetlifecycles.
6©Cloudera,Inc.Allrightsreserved.
Peopledomining…forcenturies!
http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html
gold&diamonds,ore&coal,minerals,oil…
Outcomedriveswholeeconomy
7©Cloudera,Inc.Allrightsreserved.
Peopleusecomputers…fordecades!
1938Z1:World’s firstfreeprogrammabledevice,createdbyConradZuse.
U.S.Departmentof Energy uses IntelSupercomputer atArgonne NationalLaboratory.
2015
http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png
http://www.horst-zuse.homepage.t-online.de/z1.html
8©Cloudera,Inc.Allrightsreserved.
DATA
MINING
http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/Blog: About Learning Data Mining & Data Analysis
9©Cloudera,Inc.Allrightsreserved.
Ifdataisthenewoil…
…metadataarenuggetsandbrilliantsofourage.
Screenshot takenfrom:https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil
10©Cloudera,Inc.Allrightsreserved.
Diamonds: arebeautifulevenasrawmaterial.
Brilliant: isaresultofexpert’swork.Youhavetocutandgrind it!
Evenmoreexcitingincombinationwithothermaterialandskills…
Processoptimization
Requiresknowledgegatheringandtransfer.
11©Cloudera,Inc.Allrightsreserved.
• Idea&Vision•Material• Skills/Methods• Tools
SuccessFactors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html
12©Cloudera,Inc.Allrightsreserved.
• Idea&Vision•Material• Skills/Methods• Tools
SuccessFactors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html
Toolsandprocessesevolve…...successcriteriahavebeenstable.
13©Cloudera,Inc.Allrightsreserved.
Let’sThinkDataDriven!
•Buildalong-termstrategy!
Notthefancytoolsetbutratheryourdata iswhatmattersmost!
• Afterinitialsuccessyoushouldcarefullycontrolspeedofexpansion.•Maximizeaccessibilityofdata!
Example:Google’sgoalwastomakethedataoftheinternetaccessible.YoushouldbecomeyourownGoogle!
• Idea&Vision• Material• Skills/Methods• Tools
14©Cloudera,Inc.Allrightsreserved.
DatasetProfiles/FlowDescriptors
•Ourmaterialisdata&metadata:
- Dataaboutdata:descriptivedata,Dublincoremetadatamodel,…- Deriveddata:statisticsextractedfromprocesses,documents,…- ResultsofML/AIprocedures:extractedstructureandlearnedmodels- Outcomeofcrowdbasedoperations:Wikipedia withitsinherentstructure,communicationlogs,accessandedithistory.
• Idea&Vision• Material• Skills/Methods• Tools
16©Cloudera,Inc.Allrightsreserved.
Science:
AccordingtoWikipedia:
Scienceisasystematicenterprisethatbuildsandorganizesknowledge intheformoftestableexplanationsand predictions aboutthe universe.
https://en.wikipedia.org/wiki/Science
17©Cloudera,Inc.Allrightsreserved.
DataScience:
Myobservation:
Data Scienceisasystematicenterprisethatbuildsandorganizesknowledge intheformoftestable explanations andpredictions about themarketandbusinesscontext.
https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif
20©Cloudera,Inc.Allrightsreserved.
Result:VisualizationofFacts• Animageshowswhatthetextsays.>Multi-channelcommunication
• DataSciencebenefitsfromsuchanapproach.>Todaywestilluseinfographics
Difference:Biologistwhocreatedtheimageontheleftobservedbyeye.
Today,datascientists,lookmoreintodatathanintonature.
21©Cloudera,Inc.Allrightsreserved.
Process:KnowledgeExtractionisaNaturalProcess
• Combinemultiplesources
• Repeatobservation
• Incorporatecontexttoexplaindifferences/variation
• Cross-checkstoidentifyanomalies
23©Cloudera,Inc.Allrightsreserved.
HowdidweimplementEMDM?
- HadoopBased:forscalability.
- OpenGraphDataModel:forflexibilityandconnectivity
- DataCentric:followingtheBigDataparadigm
28©Cloudera,Inc.Allrightsreserved.
DataScienceProcessModel(DSPM)
• DSPMdefinescoreartifactsforknowledgemanagement• Describesanalysis/transformationcontext• Allowsrepeatableexecution• Processpropertiesbecomemeasurable• Supportscomparisonofresultsfrommultipleprocedures
• Allthosefactsareessentialingredientstobusinessoptimization.• But:Logging&tracking shouldneverblockcreativity!• Remember:Scientistsoftenactlikeartists.
• Idea&Vision• Material• Skills/Methods• Tools
ToolboxandManagementMethods
29©Cloudera,Inc.Allrightsreserved.
DataScienceProcessModel(DSPM)• Idea&Vision• Material• Skills/Methods• Tools
Representationofdomainknowledge(inourcaseitisdatascienceingeneral)
HumanInteraction
Ontology ToolboxandManagementMethods
AbilitytosolveaproblemusingITanddata
TechnologyAspects- representandinter-actwithfacts&data
DataGovernanceCertifiedQM
30©Cloudera,Inc.Allrightsreserved.
• Idea&Vision• Material• Skills/Methods• Tools
SemanticLogging
• Propertywithname:(K,V) :key-valuepair• Propertyofathing:S=>(K,V) :(S,P,O)isa tripleKbecomesP; VbecomesO
• ManyofthosetriplesinonecommoncontextwithnameG:G=>(S,P,O)iscalledquad ornamedgraph
Wehavetohidethistechnicaldetailsfromusers!
Obviousfactshavetobeconnectedtotheknowledgegraphasdirectaspossible.• Log4Jistheloggingstandardwebuildon.• Usingstructureddatainsteadofplainstringsallowseasyparsing(e.g.,apachelogformat).• Triplerepresentationavoidsspecificparsingandmakeslogdatapartofthelinkeddatagraph.
31©Cloudera,Inc.Allrightsreserved.
• Idea&Vision• Material• Skills/Methods• Tools
Etosha Toolbox
Dataextractors,Datatransformers,
Ontologybasedorchestration,
Peopleandmachines,contribute facts,
Iterativeapproachwithclosedfeedback-loops,
Scalableenvironment…
CONCEPT
32©Cloudera,Inc.Allrightsreserved.
• Idea&Vision• Material• Skills/Methods• Tools
Multi-layermetadatacapturing
OperationalmetricsMetricsabout fast&staticdataBusinessmetrics
ContextualizedpresentationAd-hocqueries forexplorationGraph-analytics
>Knowledgeexposure
>Self-ServiceDSandBIcanspeakthesamelanguage.
INITIAL
IMPLEMENTATION
33©Cloudera,Inc.Allrightsreserved.
Results:BetterCollaborationfor(Hadoop)KnowledgeWorkers
• OurAchievements:• Theopengraphmodelislanguage-,OS-,andhardware-independent.• Mergingofknowledgepartitionsenables clusterspanningmetadataexploration.• Querybeansexposefactsfrommultiplestorestoweb-basedinterfaces.
• NextSteps:• Improveimplicittriplification (QuerySolr-indexandgetRDFdata)• Standardizetheprocessandintegratewithexistingontologies.• Growacommunity…andentertheApacheIncubator.
34©Cloudera,Inc.Allrightsreserved.
Results:AccessFacts & Context ofCriticalProcessesDEMO:https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be