30
1 © Hortonworks Inc.2011 – 2017. All Rights Reserved Enterprise Data Warehouse Optimization Piet Loubser VP Product and Solutions Marketing Hortonworks Dr Barry Devlin Founder & Principal 9sight Consulting

Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Embed Size (px)

Citation preview

Page 1: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

1 ©HortonworksInc.2011–2017.AllRightsReserved

EnterpriseDataWarehouseOptimization

PietLoubserVPProductandSolutionsMarketing

Hortonworks

Dr BarryDevlinFounder&Principal9sightConsulting

Page 2: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Copyright© 20179sightConsulting, AllRightsReserved

DrBarryDevlin

Founder&Principal9sightConsulting

TheEDWLivesOn

TheBeatingHeartoftheDataLake

10August2017

HortonworksWebinar

Page 3: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Dr.BarryDevlin

3 Copyright©2017,9sightConsulting

FounderandPrincipal9sightConsulting,www.9sight.com

Dr. BarryDevlinisafounderofthedatawarehousingindustry,definingitsfirstarchitecturein1985.Aforemostauthorityonbusinessintelligence(BI),bigdataandbeyond,heisrespectedworldwideasavisionaryandthought-leaderintheevolvingindustry.Barryhasauthoredtwoground-breakingbooks:theclassic"DataWarehouse--fromArchitecturetoImplementation"and“BusinessunIntelligence--InsightandInnovationBeyondAnalyticsandBigData”(http://bit.ly/BunI_Book)in2013.

Barryhasover30yearsofexperienceintheITindustry,previouslywithIBM,asaconsultant,manageranddistinguishedengineer.Asfounderandprincipalof9sightin2008,Barryprovidesstrategicconsultingandthought-leadershiptobuyersandvendorsofBIandBigDatasolutions.HeisanassociateeditorofTDWI'sJournalofBusinessIntelligence,andaregularkeynotespeaker,teacherandwriteronallaspectsofinformationcreationanduse.

BarryoperatesworldwidefromCapeTown,SouthAfrica.

Email:[email protected]

Twitter:@BarryDevlin

Page 4: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

4 Copyright©2017,9sightConsulting

Agenda

1. Past– fromawarehousetoalake

2. Present– awarehouseand alake

3. Emerging– awarehousebyalake

4. Conclusions

Page 5: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Thedataarchitecturesincethemid-’80s

§ TwolayerswithintheDataWarehouse…– Enterprisedatawarehouse

– Reconcileddata– Datamarts

–Whattheusersneed

§ …fedfromandseparatetooperationalsystems– Datatorunthebusiness– Createdbytheprocessesofthebusiness

§ Alldatacreatedwithintheenterprise(orwithinpartnerecosystem)

5 Copyright©2017,9sightConsulting

Datamarts

Enterprisedatawarehouse

Metadata

Datawarehouse

Operationalsystems

“Anarchitectureforabusinessandinformation system”,B.A.Devlin, P.T.Murphy,IBMSystems Journal, (1988)

Page 6: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Thedrivetowardthedatalakesince2010

§ Datawarehousearchitecture“old-fashioned”– Linkedto(traditional)relationaldatabases– Toostructured,schema-on-write– Tooslow/complextobuild– Lackingsupportforbigdata– NolinktoHadoop

§ Datalakeproposedasalternative– Cheaper,biggerandmoreflexible– Structure-agnostic,schema-on-read(latebinding)

– Supportsalldatatypes– Agile,flexible,rapidimplementation– DrivenbyHadoopecosystem– Datareservoir– abetter(?)architecteddatalake

6 Copyright©2017,9sightConsulting

Data warehouse

Image:GartnerviaBillSchmarzo,infocus.emc.com/william_schmarzo/data-lake-data-reservoir-data-dumpblah-blah-blah/(2014)

Page 7: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Datalakearchitecture

7 Copyright©2017,9sightConsulting

www.capgemini.com/blog/capping-it-off/2014/08/you-have-to-manage-your-data-lake-the-fallacy-of-technology-being-magic

Page 8: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

FromBItoBusinessunIntelligence

§ Peopleprocessinformation

§ People:Rationalthoughtandfarbeyond– Peoplemakealldecisions!

§ Process:Logic– predefined,emergent– Decisionmakingisaprocess

§ Information:Data,knowledge,meaning– Data/informationisonlythefoundation

§ Notbusinessintelligence…BusinessunIntelligence§ Amazon:http://bit.ly/BunI_Book

§ Orhttp://bit.ly/BunI-TP2:25%discountwithcode“BIInsights25”

8 Copyright©2017,9sightConsulting

Information

Process

People

Page 9: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

BusinessunIntelligence– Informationpillars

§ Onearchitectureforalltypesofinformation– Mix/matchtechnologyasneeded

– Relational,NoSQL,Hadoop,etc.

§ Integrationofsourcesandstores– Instantiationgathersinputs– Assimilationintegratesstoredinfo.

§ Dataflowsasfastasneededandreconciledwhennecessary– Nounnecessarystorageortransformations

§ Distinctdatamanagement/governanceapproachesasrequired

9 Copyright©2017,9sightConsulting

Transactions

Human-sourced

(information)

Machine-generated

(data)

Process-mediated(data)

Context-setting(information)

Assimilation

Transactional(data)

EventsMeasures Messages

Instantiation

Page 10: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Positioningofdatalakeandwarehousetoday

§ Servedifferentpurposes– Functional– run/managethebusiness– Illustrative– predict/influencethefuture

§ Bothrequired– Optimizedfordifferentstrengths– Warehouse=accuracyandconsistency– Lake=timelinessandrawness

§ Linksbetweenenvironments– Betterthancopyingeverythingintoone(orboth)

§ Together– foundationforpervasiveanalytics

10 Copyright©2017,9sightConsulting

Events Measures Messages

Datawarehouse

FunctionalAccurate, consistentdata

DiscardedifoutdatedLegallybinding,

traceableprocess

Transactions

DataLake

IllustrativeTimely,rawdataStoredforeverCreative,free-flowingprocess

Operationalsystems

Useraccesstoall data

Page 11: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Awarehousebyalake(1)Preparationandenrichment

§ Challenge:ETL(extract,transform andload)todatawarehousecomplexandcomputationallyexpensive

§ Transformin:– ProprietaryETLserver– withhighlicensingcost

– Datawarehouseserver– withimpactonanalytictasks

§ Solution:Pumpsomeoralldatathroughthedatalake– Reducedprocessingcostand/orimpactonDWwork

11 Copyright©2017,9sightConsulting

Datawarehouse

Transactions

Op.systems

Events Measures Messages

DataLake

Useraccesstoall data

Page 12: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Awarehousebyalake(2)Archival

§ Challenge:Storingseldom-used(cold)datainadatawarehouseisanexpensivewasteofhigh-performancehardware

§ Archivingtomagnetictapedelaysandcomplicatesaccesstooff-linedatawhenneeded

§ Solution:archivetocommodityserversanddisksindatalake– Hadoop– nolicensingcosts– Fasteraccesswhenneeded–almostequaltoDW

– Sametools(SQL-based)foraccessasDW

12 Copyright©2017,9sightConsulting

Datawarehouse

Transactions

Op.systems

Events Measures Messages

DataLake

Useraccesstoall data

Page 13: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Awarehousebyalake(3)Access

§ Challenge:Dataincreasinglyresidesondisparateplatforms– Traditionalbusinessinfoinrelational– BusinesspeoplefamiliarwithSQL– Socialmedia,IoTonHadoop/NoSQL/etc.

– Copyingbackandforthisexpensive

§ Solution:Virtualizeaccesstodataonallplatforms– SQL-basedqueries– Joindataacrossplatforms

13 Copyright©2017,9sightConsulting

Datawarehouse

Transactions

Op.systems

Events Measures Messages

DataLake

Useraccesstoall data

Page 14: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Conclusions

1. Enterprisedatawarehouseliveson– Focusedoncorebusinessinformation– Traditionalrelationalplatformsstillpreferred

14 Copyright©2017,9sightConsulting

2. Datalakecomplementsdatawarehouse– Focusedonexternallysourceddata– Linkedtodatawarehouseinmultipleways

3. Datalakecanassist/offloaddatawarehouse– Usecommoditystorageandprocessingpower– Reducecostsandimproveperformance

Page 15: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

Copyright© 20179sightConsulting, AllRightsReserved

DrBarryDevlin

Founder&Principal9sightConsulting

ThankYou

Page 16: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

PietLoubserVPProductandSolutionsMarketingHortonworks

Page 17: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

17 ©HortonworksInc.2011–2017.AllRightsReserved

TheNewWayofBusinessIsFueledByConnectedData

• ConnectedCustomers,Vehicles,Devices• Sociallycrowd-sourcedrequirements• Digitaldesignandanalysis• Digitalprototypesandtests(simulations)

• ConnectedFactories,Sensors,Devices• Human-roboticinteraction• 3D-printingondemand

• ConnectedTrucks,Inventory• Location,traffic,weather-awaredistribution• Real-timeinventoryvisibility• Dynamicrerouting

• ConnectedCustomers,Devices• Omni- channeldemandsensing• Real-TimeRecommendations

• ConnectedAssets• Remoteservicemonitoring&delivery• Predictivemaintenance• OTAUpdates

Development Manufacturing Distribution Marketing/Sales Service

Page 18: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

18 ©HortonworksInc.2011–2017.AllRightsReserved

D A TA C E N T E R

EnterpriseDataLake

DataFlow&Stream

Processing

BigDataCloudService

C L O U D BigDataCloudService

AConnectedDataStrategyConnectsDataCenterandCloud

SecurityDataLake

AWSIaaSAzureIaaS

Page 19: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

19 ©HortonworksInc.2011–2017.AllRightsReserved

TypicalEDWArchitectureUsedinefficiently, from$7,500to$35,000perTB1 ofdatastoredandprocessed

InatypicalEDW:• 50-70%ofdataisunusedand/orcold• 45-65%ofCPUcapacityisETL/ELT

• 25-35%ofCPUconsumedbyETListoloadunuseddata

• 30-40%ofCPUisconsumedbyonly5%ofETLworkloads

• Aslittleas2.8%ofthedataisHot1

ANAL

YTICS

DataMarts

BusinessAnalytics

Visualization&Dashboards

DATASYSTEMS

SystemsofRecord

RDBMS

ERP

CRM

Other

Source:HortonworksInnovationandStrategyTeamandAppfluent Analysis1.EYAnalysisshowstypicalrangefrom$10-15k/TB.Hortonworksexperienceshowsawiderangeobservedinthefield,from$35k/TBformassive,in-memoryEDWappliancesto$7.5k/TBforRDBMSbased,home-grownEDWsolutions2.Forexample,foraclientkeepingarolling36-monthwindowofdataforreportinginanEDW,only1monthofthe36(2.8%)is new/hot.

Page 20: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

20 ©HortonworksInc.2011–2017.AllRightsReserved

HortonworksConnection:ServicesandSolutionsforYourSuccess

DataServices

HortonworksSolutions

EnterpriseDataWarehouseOptimization

CyberSecurityandThreatManagement

InternetofThingsandStreamingAnalytics

DataScienceExperience

AdvancedSQL

DataCenterHortonworks DataSuite

HDFHDP

HortonworksConnection

CloudHortonworks DataCloud

AWS HDInsight

HortonworksConnectionEnablementSubscriptionSmartSense™

PremierOperationalSupportEducationalServicesProfessionalServices

CommunityConnection

Page 21: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

21 ©HortonworksInc.2011–2017.AllRightsReserved

EnterpriseDataWarehouseOptimization

DramaticCostReductionsReducecostofyourEDWImplementationbyoffloadingETLprocessesandarchivingcolddata

DeployBusinessIntelligenceonHadoopEmpowerBusinessuserswithpowerfulreporting,newapplications,visualizationtools,andartificialintelligence

SupportMoreTypesofUnstructuredDataIndexandsearchimages,videos,text&soundfiles

Page 22: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

22 ©HortonworksInc.2011–2017.AllRightsReserved

EDWPlusHadoophelpsyouoptimizeandreducecostsassociatedwithyourEDE

Archive Cold Data away from EDW• MovecoldorrarelyuseddatatoHadoop

asactivearchive• Storemoreofyourdatalonger,cheaper

Offload costly ETL process• FreeyourEDWtoperformhigh-valuefunctionslike

analytics&reporting,notETL• UseHadoopforadvancedormassive-scaleETL/ELT

ANAL

YTICS

DATASYSTEMS

DataMarts

BusinessAnalytics

Visualization&Dashboards

SystemsofRecord

RDBMS

ERP

CRM

Other

ELT

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

N

ColdData,DeeperArchive&NewSources

EnterpriseDataWarehouse

Hot

DataScience

OLAPonHadoop

Page 23: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

23 ©HortonworksInc.2011–2017.AllRightsReserved

EDWOptimization:ETLOffload

à TheProblem:– EDWsconsumebetween50%and90%of

CPUjustonETL/ELTtasks.– Thesejobsinterferewithmorebusiness-

criticaltaskslikeBIandadvancedanalytics.

à TheSolution:– HiveandHDPdeliverETLthatscalesto

petabytes.– SyncsortDMX-hforsimpledrag-and-dropETL

workflows.– Economicalscale-outprocessingon

commodityservers.

à TheResult:– BetterSLAsformission-criticalanalytics.– LimitEDWexpansionorretireoldsystems.

ETL/ELT

DATAMART

DATALANDING&

DEEPARCHIVE

CUBEMART

ENDUSER

APPLICATIONS

APPLICATIONS

APPLICATIONS

ENDUSERSANDAPPS

EDWOPTIMIZATIONSOLUTION

Page 24: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

24 ©HortonworksInc.2011–2017.AllRightsReserved

EDWOptimization:ActiveArchive

à TheProblem:– Increasingdatavolumesandcostpressure

forcedatatobearchivedtotape.– Archiveddatanotavailableforanalytics,or

mustberetrievedatgreatexpense.

à TheSolution:– AdoptingHadoopdeliverscostperterabyte

onparwithtapebackupsolutions.– DatainHadoopcanbeanalyzedbyallmajor

BItools,allowinganalyticsonarchivedata.

à TheResult:– Dataalwaysavailableforanalytics.– Storeyearsofdataratherthanmonths.

ETL/ELT

DATAMART

DATALANDING&

DEEPARCHIVE

CUBEMART

ENDUSER

APPLICATIONS

APPLICATIONS

APPLICATIONS

ENDUSERSANDAPPS

EDWOPTIMIZATIONSOLUTION

Page 25: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

25 ©HortonworksInc.2011–2017.AllRightsReserved

EDWOptimization:FastBIonHadoop

à TheProblem:– ProprietaryEDWsystemswereadoptedfor

FastBIanddeepslice-and-diceanalytics,butEDWpricesareunsustainablyhigh.

à TheSolution:– InteractiveSQLisarealityonHadooptoday.– PartnerSolutions(IBMBigSQL,Kyvos,Jethro)

addspowerfulSQLandOLAPcapabilitiesfordeepdrilldownatscale.

à TheResult:– Queryterabytesofdatainseconds.– ConnectyourfavoriteBItoolslikeTableauand

ExcelthroughSQLandMDXinterfaces.– TheEDWOptimizationSolutionistailor-made

todeliverFastBIonHadoop.

ETL/ELT

DATAMART

DATALANDING&

DEEPARCHIVE

CUBEMART

ENDUSER

APPLICATIONS

APPLICATIONS

APPLICATIONS

ENDUSERSANDAPPS

EDWOPTIMIZATIONSOLUTION

Page 26: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

26 ©HortonworksInc.2011–2017.AllRightsReserved

CentricaTransformsServiceForUtilityCustomers

3MillionCustomers

ETLefficiencygains

300GB/DayIngest

DecommissionedsomeEDWs

canaccess“smartenergyreports”

from11hoursto45minutes/job

rationalizes workoffieldengineers

savingmillionsannually

SITUAT ION

Datafragmentationhidbusiness-widepatterns

fromanalysts

Existinginfrastructuremadeloadingdatadifficult&

causedanalyticbottlenecks

Goal:reducecosts,streamlineprocessesforasingleviewofcustomers

DATADISCOVERY

SmartMeterData

PREDICTIVEANALYTICS

EngineerScheduleOptimization

SINGLEVIEW

CustomerSegmentAnalysis

SINGLEVIEW

ProductCross-Sell

PREDICTIVEANALYTICSTailoredServices

SINGLE V IEWSmartMeterMobileApp

DATAENRICHMENTOn-SiteDataCapture

ACTIVEARCHIVEEDW

Offload

ETLOFFLOADStreamingIngest

“Focusingoninnovation,learningtoforgettraditionallegacywaysofworkingandapproaching itinnewwayscreatesunexpectedbehavioural changes,becausepeoplefeelfreerandtheyalsofeelvalued.”Dajit Rehal,SeniorSystemsDirector

Page 27: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

27 ©HortonworksInc.2011–2017.AllRightsReserved

EDWPlusHadoophelpsyoulandandenrichmoredatatorespondfastertonewbusinessrequests

Archive Cold Data away from EDW• MovecoldorrarelyuseddatatoHadoop

asactivearchive• Storemoreofyourdatalonger,cheaper

Offload costly ETL process• FreeyourEDWtoperformhigh-valuefunctionslike

analytics&reporting,notETL• UseHadoopforadvancedormassive-scaleETL/ELT

Land & Enrich more data to create more value-add analytics• UseHadooptoingestnewdatasources,suchasweb

andmachinedatafornewanalyticalcontextfromunstructuredandsemi-structuredsources

• Createananalyticalsandbox foradvanceddatascience

ANAL

YTICS

DATASYSTEMS

DataMarts

BusinessAnalytics

Visualization&Dashboards

SystemsofRecord

RDBMS

ERP

CRM

Other

ELT

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

N

ColdData,DeeperArchive&NewSources

EnterpriseDataWarehouse

Hot

DataScience

OLAPonHadoop

Clickstream Web&Social Geolocation Sensor&Machine

ServerLogs

Unstructured

NEW

SOUR

CES

Ingest Stream Events

Page 28: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

28 ©HortonworksInc.2011–2017.AllRightsReserved

PrescientHarnessesMachineLearningforTravelerSafetyWarnings

SITUAT ION

Couldonlyproduceoneassessmentevery3-4days

Performs riskmanagement

Useshumanstoidentify falsepositives

Neededefficientwaytostorerawdataforanalytics

49,500DataSources

700%ProductivityImprovement

5PetabytesofData

HybridArchitecture

ingestedbyHDFintoHDP

forgeospatialanalysts

storedinHDPconnectedEMC

HDFconnectsdatacentertocloud

ETLOFFLOADSensorData

Ingest

DATADISCOVERY

ThreatAssessments

SINGLEVIEWGlobal

ThreatMap

PREDICT IVEANALYT ICSThreat-ProximityMobileAlerts

ACTIVEARCHIVEStreaming

ThreatArchive

DATAENRICHMENTProvenanceMetadata

“Weknowthatwhenwedefineahigh-threatareainagivenareaoftheworld,thatitisunderpinnedbyveryspecificdatasources.It’sdata-driven,andwecanpointtothosesources—ifeverasked—andsay,‘Here’swhy.’”MikeBishop,ChiefSystemsArchitect

Page 29: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

29 ©HortonworksInc.2011–2017.AllRightsReserved

WhyHortonworks?

PoweringAllDataData-at-Rest,Data-in-Motion

Cloud,On-PremisesStructured,unstructured

PoweredBy100%OpenSource

RapidinnovationDramaticcostreduction

EnterpriseReadyGovernance

FinegrainedsecurityLineageanddataprovenance

hortonworks.com/get-started/big-data-scorecard/ForresterWave:BigDataWarehouse,Q22017

Page 30: Exploring the Heated-and Completely Unnecessary- Data Lake Debate

30 ©HortonworksInc.2011–2017.AllRightsReserved

ThankYou