54
BIG Data Warehouses A Big Data Perspective on Data Warehouses Seminario per il corso di Sistemi di Elaborazione di Grandi Quantità di Dati Francesca Zerbato [email protected] 1

Data Warehouses and Big Data

Embed Size (px)

Citation preview

Page 1: Data Warehouses and Big Data

BIG Data WarehousesABigDataPerspectiveonDataWarehouses

Seminario peril corso diSistemi diElaborazione diGrandi Quantità diDati

[email protected]

1

Page 2: Data Warehouses and Big Data

BigDataWarehouse

hybriddatawarehousearchitectureBigDataWarehouse(BDW)

2

BigDatascenario

TechnologicalandimplementationGAP

TraditionalData

Warehouse

whatenterprise

has

whatenterprisewantstobecome

Page 3: Data Warehouses and Big Data

Presentationoutline

1. DataWarehouse:thetraditionalbusinessintelligenceapproach- Introductiontodatawarehousing,DFMconceptualmodelandROLAPlogicaldesign

2.ThearrivalofBigData:theneedforscalabilityinDWarchitecture- TypesofdatainBigData,BusinessRelevanceandBigDataarchitecturalrequirements

3.DataVault:anapproachtoenhanceDWstodealwithBigDatachallenges- IntroductiontoDataVault2.0modelandarchitecture

3

Page 4: Data Warehouses and Big Data

Time(out)lineTheroleofITfrompassivetoactive

1970

1985

2000

TransactionalDatabases BusinessIntelligenceandDW BigDataandNoSQL

Goal:reliability,makesurenodataislost

Goal:dataforthemasses,everyonehasaccess

toeverything

Goal:Analyzedataunderdifferentperspectives

tomakedecisions

RDBMSandrelationalDBmodel

DataWarehouseandROLAPstarschema

DataVaultModel:RDBMS+NoSQL

4

Page 5: Data Warehouses and Big Data

Data≠ InformationAnexplicative(businessintelligence)example

Problem:Salesforlollipopshavegonedowninthelast6months.

Data:Salesrecords,customerdata,socialnetworkdata,marketanalysis.Datarecordsaregroupedbytime,region,customerage.

Information:Lollipopsareboughtbyfemalesolderthan25tobeeatenbypeopleyoungerthan10.

Knowledge:Mothersbelievethatlollipopsarebadforchildrenteeth.

Value:Hireadentisttoadvertiselollipops.5

Page 6: Data Warehouses and Big Data

DataWarehouseDefinition

Adatawarehouseisasubject-oriented,integrated,time-variant andnon-volatile collectionofdatainsupportofmanagement'sdecisionmakingprocess.

• Subject-oriented:analysisofsubjectareas.• Integrated:datacomesfrommultiplesources.• Time-variant:historicaldataarecollected.• Non-volatile:nodatamodification/removal.

“Adatawarehouseisacopyoftransactiondataspecificallystructuredforqueryandanalysis”- RalphKimball,majorDWtechnologycontributor

6

Page 7: Data Warehouses and Big Data

DataWarehouseOLTPvsOLAPsystems

On-LineTransactionProcessing(OLTP):dataprocessingsystemfacilitatingmanagementoftransaction-orientedsoftware.Largenumberofshorton-linetransactions(INSERT,UPDATE,DELETE).Dataisdetailed,nothistorical,highlynormalized,joindoesnotperformwell.

On-LineAnalyticalProcessing(OLAP):dataprocessingsystemenablingtheanalysisofmultidimensionaldata,interactivelyandfrommultipleperspectives.DATAWAREHOUSESaredesignedtosupportOLAPoperations.Queriesareoftenverycomplexandinvolveaggregations.Dataishistorical,denormalized,redundant,joiniseasierandfaster.

7

Page 8: Data Warehouses and Big Data

DataWarehouseArchitecturalanduserequirements

DWusegoals:

• Correctnessandcompletenessofintegrateddata.Singleversionofthetruth.• Accessibility touserswithlimitedknowledgeofcomputing.• Dataissummarized/aggregated forflexiblequeryandintuitiveview.RequirementsforaDWarchitecture:

• Scalability:hardwareandsoftwarearchitecturemustbeeasilyscaled.• Extensibility:mustbeabletoaddnewapplications.• Security:accesscontrolisrequiredbecauseofthenatureofthedatastored.Strategicdataarememorized.

8

Page 9: Data Warehouses and Big Data

1.Sources:operationaldatasources,flatfiles.

2.ReconciledDataLevel:Extraction-Transformation-Loading(ETL)toolstoobtainareconcileddatalevelbeforefeedingtheDW.

3.DataWarehouseLevel: centraldatawarehouse,datamartsandmeta-datarepository.

Presentation(front-end):AnalysisandvisualizationOLAPtools,data-miningtools,reportingtools,what-ifanalysistools.

DataWarehouse3-TierArchitecture

9Presentation tools

STAGINGLEVELETL

DATAWAREHOUSELEVEL

Reconcileddata

Datamarts

Meta-data

OPERATIONALDATASOURCES

Page 10: Data Warehouses and Big Data

DataWarehouse3-TierArchitecture

10Presentation tools

STAGINGLEVELETL

DATAWAREHOUSELEVEL

Reconcileddata

Datamarts

Meta-data

OPERATIONALDATASOURCES

1.Sources:operationaldatasources,flatfiles.

2.ReconciledDataLevel:Extraction-Transformation-Loading(ETL)toolstoobtainareconcileddatalevelbeforefeedingtheDW.

3.DataWarehouseLevel: centraldatawarehouse,datamartsandmeta-datarepository.

Presentation(front-end):AnalysisandvisualizationOLAPtools,data-miningtools,reportingtools,what-ifanalysistools.

DataMartsareasubsetoranaggregationofdatastoredin

primaryDW,targetedtowardsaparticularfunctionalareaoruser

group.

Page 11: Data Warehouses and Big Data

DataWarehouse3-TierArchitecture

11Presentation tools

STAGINGLEVELETL

DATAWAREHOUSELEVEL

Reconcileddata

Datamarts

Meta-data

OPERATIONALDATASOURCES

1.Sources:operationaldatasources,flatfiles.

2.ReconciledDataLevel:Extraction-Transformation-Loading(ETL)toolstoobtainareconcileddatalevelbeforefeedingtheDW.

3.DataWarehouseLevel: centraldatawarehouse,datamartsandmeta-datarepository.

Presentation(front-end):AnalysisandvisualizationOLAPtools,data-miningtools,reportingtools,what-ifanalysistools.

DataMartsareasubsetoranaggregationofdatastoredin

primaryDW,targetedtowardsaparticularfunctionalareaoruser

group.

Meta-data is“dataaboutdata”.Businessmeta-datadescribessemantics,businessrulesand

constraints.Technicalmeta-datadescribeshowdataisstoredandhowitshouldbemanipulated.

Page 12: Data Warehouses and Big Data

DataWarehouseArchitecture,anotherview

12

Structureddatasources

Page 13: Data Warehouses and Big Data

DataWarehouseExtraction-Transformation-Loading (ETL)tools

ETLtoolsfeedasingledatarepository,detailed,comprehensiveandofhighquality,whichmayinturnfeedtheDW(Reconciliationprocess:reconcileddatalevel.)• Offline,carriedoutwhenDWisnotinuse(atnight?).• Batchprocessing.• Asubsetofdata,identifiedbybusinessgoalsisobtained:

GIVEASINGLEVERSIONOFTHETRUTH.Extraction:dataaregatheredfromsources.• INTERNALtransactionalsystems,flatfiles.• EXTERNALsources.ODBC,JDBC.

13

Page 14: Data Warehouses and Big Data

DataWarehouseExtraction-Transformation-Loading (ETL)tools

Transformation:dataisputintothewarehouseformat.Businessrulesareusedtodefineeitherpresentation/visualizationofdataandpersistencecharacteristics.• Cleaning:removeserrors,inconsistenciesandconvertsdataintoastandardizedformat.• Integration:dataisreconciled,bothatschemaanddatalevel.• Aggregation:dataissummarizedaccordingtotheDWlevelofdetail.

Loading:theDWisfedwithcleanedandtransformeddata,offline.Initialload(firstDWpopulation)orrefreshment.

14

Page 15: Data Warehouses and Big Data

DataWarehouse:datamodelingMultidimensionalModel:DFM(dimension-factmodel)– DATACUBE

Eachcell ofthecubeisaFACTofinterest quantifiedbynumericalmeasures.

Eachaxis representsadimension ofinterestfortheanalysis.Hierarchyofattributes:• Product• Category(Home)• Sub-category(Bedroom)

15

Date

Product

Customer

Home

Italy

15/01/2015

Page 16: Data Warehouses and Big Data

DataWarehouse:dataanalysisAnalysesonthedatacube

• Reporting:periodicalaccesstostructuredinformation.

• OLAP:analysisofoneormorefactsofinterestatdifferentlevelsofdetailbysequenceofqueriesthatgiveamultidimensionalresult.

• Datamining:extractingpatterns fromlargedatasetsbycombiningmethodsfromstatisticsandartificialintelligencewithdatabasemanagement.

16

Page 17: Data Warehouses and Big Data

DataWarehouse:logicalmodelSTARSchema:relationalOLAPdatamodel

MostDatacubesarebuiltontherelationalmodel.Astarschemaiscomposedby:• AcentralrelationFT,FactTable,representingthefactofinterest.• Asetofrelationscalleddimensiontables,eachofthemcorrespondingtoonedimensionoftheanalysis(cubeaxis).Everydimensiontableischaracterizedby• aprimarykey• asetofattributesthatdescribethedimensionsofanalysisatdifferentlevelsof

aggregation.

FT alsocontainsanattributeforeachmeasure.

17

Page 18: Data Warehouses and Big Data

DataWarehouse:logicalmodelSTARSchema:example

18

FactTable

DimensionTables

Measures

Hierarchyofattributes

Date_PK

Customer_PKProduct_PK

Quantity_orderedUnit_priceTotal_price

Date_PKDay

MonthYear

Product_PK

DescriptionCategory

Sub-category

NameCity

Country

Customer_PK

order_fact

product_dim

customer_dim

date_dim

Page 19: Data Warehouses and Big Data

DataWarehouseThearrivalofBigData:canatraditionalDWHhandleallthisdata?

19

HowcanBigDataTechnologies,suchasHadoop,Hive,HDFSbeusedinaDW/BIcontext?

Page 20: Data Warehouses and Big Data

BigDataWarehousesystemsThearrivalofBigData:canatraditionalDWhandleallthisdata?

InBusinessIntelligenceBigDatatechnologiescanbeused:1. Standalone:withtheirownquery/DBtools,querylanguages.• Analyticalgoals:Businessrequirementsarenotpreviouslydefined.

2. ComplementaryandsupportingtoenhanceexistingDWtechnologies:hybridsystemscalledBIGDATAWAREHOUSES.• Synergy amongBigDatatechnologiesandexistingDW:widerrangeofdata(Factsareintegratedwithunstructuredandmultidimensionaldata).• ImprovescalabilityandreducecostsofcurrentDWsystems.

20

Page 21: Data Warehouses and Big Data

BigDataWarehousesystemsHybridapproaches

21

SmallandBigDataSources

Hadoop RDBMS BITools

BigDataSources

SmallDataSources

Hadoop

BIToolsRDBMS

SmallDataSources

BigDataSources

RDBMS

HadoopBITools

(A)

(B)

(C)

Hadoop isusedonly fordataingestion/staging

BigDataarekeptseparatefromstructureddata:Hadoop isusedasdatamanagementplatforminparalleltotheRDBMS.Bothplatformsareusedinconjunction forpresentationpurposes.

HadoopenhancesRDBMSasdataingestion/stagingtool,butalsoasdatamanagementanddatapresentationplatform.The“best”ofbothtechnologies isexploited.

Page 22: Data Warehouses and Big Data

BigDataWarehousesystemsSomequestions

22

BeforeaddressingBigDataWarehousesweshouldanswersomequestions..

• BIonBigData?

• Whatkind ofdataarefoundinBigData?

• CanatraditionalETLtechnologyhandleBigData?

• Iffeasible,doesETLonBigDatamakesense?

Page 23: Data Warehouses and Big Data

CorporateDataTheroleofBigDatainthecorporation

Corporatedataisthetotalityofdatafoundinacorporation.

Examplesofcorporatedataare:analoginformation,telephonerecords,e-mails,marketresearchdata,callcenterrecords,payments,sales,transactions,measurements,interviews,socialnetworks..

OnewayofclassifyingthetotalityofcorporatedataisdinstinguishingbetweenSTRUCTURED andUNSTRUCTUREDDATA.

23

Page 24: Data Warehouses and Big Data

CorporateDataStructureddata

Structureddatahasapredictableandregularlyoccurringformat.Typically:

• itismanagedbyaDatabaseManagementSystem(DBMS).• consistsofrecordsorfiles,attributes,keys andindexes.• afixednumberoffieldsisdefined.ExamplesofstructureddataarethosecontainedinarelationalDB:adatamodelisclearlydefinedfordatarepresentation,storing,processing,accessingandquerying.(ACIDcompliant)

Traditionaldatawarehousesmanagestructureddata!24

definedlength definedformat

Page 25: Data Warehouses and Big Data

CorporateDataUnstructuredandsemi-structureddata:BIGDATA

Unstructured dataisunpredictable, andusuallydoesnothaveaneasilycomputer-recognizableformat.Longstringshavetobesearched(parsed)in ordertofindaunitofdata!

Examples:free-text,images,videos,webpages,webserverlogs,…

Semi-structured datahastags/markers thathelpindiscerningdifferentdataelements,butitlacksofastrictdatamodel.Examplesofsemi-structureddataare:RSSfeeds,metadata.Formats:XML,JSON,...

25

Page 26: Data Warehouses and Big Data

CorporateDataRepetitivenessinBigDataUnstructureddatacanbedividedinto:

- REPETITIVE:itoccursmanytimes,ofteninthesameembodiment.Typically,thiskindofrecordscomesfrommachineinteractions.Processingandanalysis:Hadoopcentricdata.Examples:analogprocessing,telephonecallrecords.

- NON-REPETITIVE unstructureddata:recordsaresubstantiallydifferentfromeachotherinformandcontent.

Processingandanalysis:NLP,Textualdisambiguation:dataisputintocontextandreformattedforstandardBIanalysis.

Examples:e-mails,healthcarerecords,marketresearch,meteorologicalrecords.

26

Page 27: Data Warehouses and Big Data

CorporateDataRepetitivenessmeasuresBusinessRelevance

Businessrelevancemeasuresthecapabilityofdatatoprovideinformationthatisofinterestforaspecificbusinesscontext.Businessrelevantinformationisusedtosupportdecisionmaking,solutiongenerationandcostoptimization.

REPETITIVEBIGDATAarehardlyeverbusinessrelevant:Millionsofphonecallrecords,onlyafewofwhicharerelevantforgovernmentalpurposes.

27

Page 28: Data Warehouses and Big Data

CorporateDataAcompletepictureofcorporatedata

CORPORATEDATA

StructuredData UnstructuredandSemi-structuredData

Repetitive Non-Repetitive

Busin

essRE

LEVA

NT

Busin

ess

IRRE

LEVA

NT

POTENTIALLY

busin

essrelevant

Busin

essRE

LEVA

NT

Busin

ess

IRRE

LEVA

NT

So… isallthisdatausefulforsupportingdecisionmaking?BIGDATA:apluralversionofthetruth

28

Page 29: Data Warehouses and Big Data

BigDatarequireforDWimprovementsTheneedforanecosystemtointegrateHadoopandNoSQLtechnologies

BigDatarequireadifferentapproach todatawarehousing:• Volume:Memorizationandprocessingmustbeparallelized.

• Hugeworkload,concurrentusersanddatavolumesrequireoptimizationofbothlogicalandphysicaldesign.

• ETLphaseisabottleneckand“nonsense”forBigData:Bigdatagoalistogatherdatatobeusedinwaysthathavenotbeenplanned.• Discover/extractnewinsightsindata:Exploratoryapproach• Processingisondata:lineageandmeta-dataarerequired.• TraditionalETLdoesnotworkwellonunstructureddata.Manualcodingfordataintegration.

• Rawdatapersistinthewarehouse:lineagethroughsoftbusinessrulescanbepostponedaccordingtoanalysisneeds.BigDatacallforELT.

29

Page 30: Data Warehouses and Big Data

• Datacomplexityincreases:• Variety ofdatarequiresspecificprocessingtechniques

• Textualdisambiguation,parsing,machine-generateddataanalysis.• Velocity ofdatarequiresalmostreal-timeanalysiscapabilities:

• Real-timedatashouldfeeddirectlytotheDW:On-LineTransactionalProcessingcaninpartbecarriedoutinthewarehouse.Real-timedatacannotundergoETL!

• Veracityofdatarequiresstrongintegrationandtraceability.

• Analyticalcomplexity:• BigdatahavetobeinaformatnotforeseenbyDWdeveloperstobeanalyzed.

30

BigDatarequireforDWimprovements(II)TheneedforanecosystemtointegrateHadoopandNoSQLtechnologies

Page 31: Data Warehouses and Big Data

• Querycomplexity: temporalanalysisandOLAPanalysisoncubesarenotfeasibleonBigData. OLAPisoptimizedforrelationalmodels.

• DWAvailability: additionofnewdatasourcesmightcompromisetheavailabilityoftheoverallsystem.Ithastobecarriedoutoffline.• Parallelizationofloadingisonesolution,butitmustbeembeddedinthesystem.

VerticalScaling:movetolargercomputers+HorizontalScaling:ü Functionalscaling=organizesimilardatagroupsandspreadthemacrossDBs.ü Sharding =splitdatawithintheareasoffunctionalityacrossmultipleDBs.

31

BigDatarequireforDWimprovements(III)TheneedforanecosystemtointegrateHadoopandNoSQLtechnologies

Page 32: Data Warehouses and Big Data

DataVault2.0(DV2)CommonFoundationalWarehouseArchitecture

“TheDataVaultModelisadetailoriented,historicaltrackinganduniquelylinkedsetofnormalizedtablesthatsupportoneormorefunctionalareasofbusiness.Itisahybridapproachencompassing

thetraditionalstarschema.Thedesignisflexible,scalable,consistentandadaptable totheneedsoftheenterprise”

Goal:provideandpresentinformation,extractedfromdatathathasbeenaggregated,summarized,consolidatedandputintocontext.

32

Page 33: Data Warehouses and Big Data

DataVault2.0(DV2)Aspects

1.Datamodel:changestothemodelforperformanceandscalability.Rawdata(structured+ BigData)areintegratedbybusinesskeys.

2.Methodology:ScrumandAgilebestpractices:two-tothree-weeksprintcycleswithadaptationsandoptimizations.

3.Architecture: inclusionofNoSQLandBigDatasystemsforunstructureddatahandlingandBigDataintegration.Separationofbusinessrules.

4.Implementation:GuidelinesdefinehowtoimplementDV2parts.33

Page 34: Data Warehouses and Big Data

DataVault2.0(DV2)Architecture

Basedonthe3-tierDWarchitecture:(1)staginglayer,(2)enterprisedatawarehouseEDWlayerandthe(3)informationdeliverylayer.

Additionalcomponents:

1.HadooporNoSQLhandleBigData(designrulesonwhereandhow)

2.Real-timeinformationflowsin/outoftheEDW.Operationalvault.

3.Hardandsoftbusinessrulesaresplit.• Datainterpretationispostponed:Bigdataprinciple:datafirst- schemalater!

Staginglayerislosingimportanceasrawdata persistintheEDW!34

Page 35: Data Warehouses and Big Data

DataVault2.0(DV2)Architecturepreview

35

OperationalDatasourcesSOA/ESB

EnterpriseDataWarehouse InformationDelivery

DataMarts

ReportMart

OLAP tools/ Starschema

real-time

Staging

HARDBUSINESSRULES

SOFTBUSINESSRULES

batch

Hadoop

Page 36: Data Warehouses and Big Data

DV2ArchitectureWheredoNoSQLplatformsfitinDV2?

MostcommonNoSQLplatformsarebasedonHadoopandHDFS.- Staging:Hadoopismostlyusedfordataingestion andstagingforANYDATA(structuredandunstructured)thatcanproceedintheEDW.

- EDW:NoSQLDBsareusedtostoreunstructureddata.

- Informationdelivery:Hadoopisusedtoperformdatamining.Miningresultsarestructureddatasetsthatcanbecopiedintorelationaldatabaseenginesforadhocquerying.

TheDV2modelallowsforNoSQLtechnologiestofeedall 3levels!

36

Page 37: Data Warehouses and Big Data

DV2ArchitectureHowdoesHadooptechnologyenhanceDWcapabilities?

• Cheaphardwareformemorizationofallkindsofdata.• Local storage(preferredtoStorageAreaNetworks).• Allowsprocessingdirectlyondataandbasedonthekindofdata:

• SomeBigDatamighthaveacomplexstructure(weblogs,complexsensors).• RawdatapersistinHadoop:TransformationcanberedonewithouttheneedforExtraction.Historyiseasilymaintained.

• Rawdatacanbere-used toaddcontextorconstraints.• DataminingmodelsextractedwithHadoopcanbeusedasreliablesemanticmeta-data.

37

Page 38: Data Warehouses and Big Data

DV2ArchitectureBusinessLogic:SoftandHardbusinessrulesseparation

Businessrulesarerequirements translatedintocode.IntraditionalDWbusinessrulesareappliedbeforetheloadingphase.

DV2IDEA: separatedatainterpretation(doneafter loadingdataintoEDW)fromdatastorageandalignmentrules.InsideEDWrawdataispreserved!

Hardrules:donot changethecontentofindividualfields.Examples:typealignment,splitbyrecordstructure,denormalization.

Softrules:changeandinterpretdata.Examples:standardizingnameaddresses,coalescing,concatenating namefields.

IntraditionalDWALLbusinesslogicisappliedtroughETLtools!

38

Page 39: Data Warehouses and Big Data

DV2ModelBusinessKeys

ABusinessKeyidentifiesakeyconceptinbusiness.Theyhaveabusinessmeaning!

TheyareuniqueandhaveverylowpropensitytochangeBusinesskeyschangeonlywhenthebusinesschange!

Examplesofbusinesskeysare:customernumbers,barcodes,ISBNcodes,ISSNcodes,E-mailaddresses,creditcardnumbers..Smartkeysarecomposedofdifferentpartswhicharegivenbusinessmeaningthroughpositionandformat.BusinesskeysandassociationsaretheskeletonoftheDataVaultmodel,whichisfunctional-oriented,notsubject-oriented.

39

Page 40: Data Warehouses and Big Data

DV2ModelCharacteristicsandComponents

40

DV2Modelbasicentities:• Hubs:mainbusinessconcepts,representedbybusinesskeys.

• Links:relationshipsbetweenhubs,thusbetweenbusinesskeys.

• Satellites:contextofhubsandlinks(attributesandtime).Realdatawarehousingcomponents:nonvolatiledataarestoredovertime.

Page 41: Data Warehouses and Big Data

DV2ModelHubs

Eachhubrepresentsabusinesskey,whichisvaluablefortheoverallsystemandmightbedifferentfromthesinglekeysfoundintheoperationalsources.

P1:Businesskeysareseparatedbygrainandsemanticmeaning.

BusinessKey:oneoremorekeys,identifyingtheobject.

HashKey:generatedsurrogatekeytoeaselookup.

Metadata:LoadDateindicateswhenthebusinesskeyfirstarrivedintheEDW,RecordSource keepstrackofthesource.

41

Page 42: Data Warehouses and Big Data

DV2ModelLinks

Linksmodeltransactions,associations,hierarchiesandredefinitionsofbusinessterms.Linkscapturepast,presentandfuturerelationsamonghubs.

P2:intersectionsacrosstwoormorebusinesskeysareplacedintolinkstructures.P3: linkshavenobeginorenddates.Theyarethe expressionoftherelationshipatthetimethedataarrivedinthe EDW

42

Page 43: Data Warehouses and Big Data

DV2ModelLinks:structure

HashKey:generatedsurrogatekeytoeaselookup.

HashKeysofthehubsconnectedbythelink.

Metadata

43

Page 44: Data Warehouses and Big Data

DV2ModelLinks

Linksaremany-to-manyrelationships amongtwoormorehubs.Theyabsorbdatachanges.Flexibility:changeinbusinessrulesdoesnotrequirelinkreengineering.Example:

Businessrule:“onecarrierhandlemoreairports,butoneairportmustbehandledbutoneandonlyonecarrier”à weakentitymodelLet’ssay,afewyearslater,anyairportcanbehandledbymorethanonecarrier..Thiswouldrequiretheredesigntheexistingstructures!

Thegranularityoflinksisdefinedbythenumberofconnectedhubs.

44

Page 45: Data Warehouses and Big Data

DV2ModelSatellites

Satellitesstorealldatathatdescribesabusinessobject,relationshiportransaction.TheyaddCONTEXTatagiventimeoveragivenhub/link.P4:Satellitesareseparatedbytypeofdataandclassificationandrateofchange.Eachsatelliteisattachedtoonlyonehuborlink.Asatelliteisidentifiedbytheparent’shashkeyandthetimestampofthechange.(RemindtraditionalDWhistoricaldata!)Inaddition,attributesthatdescribethecontextofthebusinessobjectarecontained.Satellitestrackchange!

45

Page 46: Data Warehouses and Big Data

DV2ModelSatellites:structure

Parentobjecthashkey

LoadDateattributeHashKey oftheparenthub/link

Timestamp ofthesatellite

Timestamp thatdeterminestheendoftheSAT’svalidity.

RecordSourcekeepstrackofthesource.Hashdifference:hashvalueofallthedescriptivedatainasatellite.

Nameandattributes.

46

Page 47: Data Warehouses and Big Data

DV2ModelHeterogeneoussatellites

47

ExampleofalogicalforeignkeybetweenRDBMSandHadoop-storedsatellite.

Hashkeysallowcross-systemjoinstooccurbetweenRDBMSandNoSQL/Hadoopplatforms.

Page 48: Data Warehouses and Big Data

DV2ModelModelexample:Customer/Product

48

Customer_HK

Customer_HKLoadDate

CustProdLink_HKLoadDate

LoadEndDateRecordSourceHashDiffQuantity_orderedUnit_priceTotal_price

CustomerProductLink

ProductSatellite

Product_HKLoadDate

Product_HK

ProductHub

CustomerSatellite

Customer Hub

Product_BKLoadDateRecordSource

Customer_BKLoadDateRecordSource

LoadEndDateRecordSourceHashDiffDescriptionCategorySub-category

LoadEndDateRecordSourceHashDiffNameCityCountry

LoadDateRecordSourceCustomer_HKProduct_HK

CustProdLink_HK

CustomerProductSatellite

Page 49: Data Warehouses and Big Data

DV2ModelModelexample:Customer/Product

49

Customer_HK

Customer_HKLoadDate

CustomerProductLink

CustomerProductSatellite

ProductSatellite

Product_HKLoadDate

Product_HK

ProductHub

CustomerSatellite

Customer Hub

Product_BKLoadDateRecordSource

Customer_BKLoadDateRecordSource

LoadEndDateRecordSourceHashDiffDescriptionCategorySub-category

LoadEndDateRecordSourceHashDiffNameCityCountry

LoadDateRecordSourceCustomer_HKProduct_HK

CustProdLink_HK

Outofproduction!

CustProdLink_HKLoadDate

LoadEndDateRecordSourceHashDiffQuantity_orderedUnit_priceTotal_price

Page 50: Data Warehouses and Big Data

DataVault2.0(DV2)Modelingobjectives

• Dataintegrationisbasedonbusinesskeys.• Businesskeysarethekeystotheinformationstoredacrossmultiplesystemsusedtolocateanduniquely identifyrecordsordata.

• Datasetsaretraceableacrossmultiplelinesofbusiness.• Modelingcansupportunstructuredandstructureddata:• Hashkeys allowtheconnectionbetweenheterogeneousdataenvironments,suchasHadoopandRDBMSandremovethedependencyon“loading”.

Parallelizationofloads:removesdependenciesinloadingstreams.Example:loadingdataintoHadoop,perhapsaJSONdocument, requireslookingupthesequencenumberfromahubinarelationaldatabase.

50

Page 51: Data Warehouses and Big Data

ConclusionsWhentouseaDV2model

• TheDV2Modelallowssplit/mergeofbusinesskeysanddataentities:• Parallel/distributedsystem,geographicalreasons,securityreasons.• Itisdesignedtoberesilienttoenvironmentalchanges.

• SeamlesslyintegratesBigDatatechnologieswithexistingRDBMStechnologies:• Hadoop,MongoDB andmanyotherNoSQLoptionsareeasilyadded.• Datacleaningrequiredbyastar-schemabecomesunnecessary:alldataisrelevant.

HadoopandRDBMSaresidebysideinBigDataWarehouses.

51

Page 52: Data Warehouses and Big Data

BigDataWarehousesvstraditionalDWAsummarizingviewoverthetwoapproaches

52

DesignPrinciple Traditional DataWarehouse BigDataWarehouse

BusinessExpectations

•Factbased;•Pre-designedforspecificreportingrequirements;•singlesourceofthebusiness truth;

•Exploratoryanalysis;•Findingofnewinsights•Veracityofresultsmightbequestionable

DesignMethodology

•Iterative andwaterfall•Integratedandconsistentmodel

•Agileanditerativeapproach•Nodatamodeldefinition

DataArchitecture •Notalldataismanagedandmaintained intheEDW:thedatasourcesarepreviouslyknown;•Anythingnewhastogothrougharigorousrequirementsgatheringandvalidationprocess;•Scalesbutatapotentiallyhighercostperbyte;

•Integratesallpossibledatastructures;•Scalesatrelativelylowcost;•Analyzesmassivevolumesofdatawithoutresortingtosamplingmechanisms.

DataIntegrityandStandards

•DrivenbyRDBMSand ETLtools.•Centralizeddata

•Integrationislooselydefined;•Dataanddataprocessingprogramsarehighlydistributed.

Page 53: Data Warehouses and Big Data

Thankyou

Thankyouforyourattention!

Q&A?

53

Page 54: Data Warehouses and Big Data

QuickReferences

Books:• DATAARCHITECTURE:APRIMERFORTHEDATASCIENTISTBigData,DataWarehouseandDataVault-W.H.Inmon,DanielLinstedt

• BigDataImperativesEnterpriseBigDataWarehouse,BIImplementationsandAnalytics– S.Mohanty,M.Jagadeesh andH.Srivatsa

• BuildingaScalableDataWarehousewithDataVault2.0- DanLinstedt,MichaelOlschimke• AdvancedDataWarehouseDesign- FromConventionaltoSpatialandTemporalApplications- E.MalinowskiandE.Zimányi

Othersources:- HadoopandtheDataWarehouse:WhentoUseWhich- Dr.AmrAwadallah,FounderandCTO,Cloudera,DanGraham,GeneralManager,EnterpriseSystems,TeradataCorporation

- BigDatainBigCompanies- ThomasH.Davenport,JillDycheDataVaultSupport:QUIPU- http://www.datawarehousemanagement.org

54