Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
@SnowflakeDB@SnowflakeDB #CloudAnalytics17
LONDON
BringingYourDataTogetherintheCloudToddBeaucheneGlobalAlliancesArchitect,SnowflakeComputing
“Data!Data!Data!Ican'tmakebrickswithoutclay.”-SherlockHolmes
Agenda
• CloudDataEcosystem• DataSources• Methodologies• DataIntegrationSolutions• Conclusion
Cloud Data EcosystemData Integration Business Intelligence &
AnalyticsData Warehouse
Enterprise apps
Data Sources
Corporate
Web
Mobile
IoT
Data Sources
Data SourcesOn-Premises• Typicallybackedbyalocaltransactionaldatabase
• Alldataliveswithinthefirewall
• Customerhasfullaccesstoalldataandsystem
Cloud• Typicallybackedbyaclouddatabase(i.e.RDS)
• CanrunincustomerVPC
• Typicallyoffersfeweroptionsthanon-premises
SaaS• Typically data is only
available via API• Outside of customer
firewall or VPC• Customer has very
little control over handling of data
Real World Example: Consolidated DashboardChallenges• Long-termprojectwithhigh-levelgoals
• Diversedatasources
• Differentrefreshcycles
• Inconsistentresults
Solutions• Agileprojectwithfocused,short-termgoals
• DedicatedschemainEDW
• DailyETLProcess
• DataqualitycheckswithinETL
Methodologies
MethodologiesBulkLoading– Trunc andLoad• Runsatregularintervals• Fulldatasetloadedduringeachrunandexistingdataispurged
• Leastefficientoption,butverysimpletomanage
• Highdatavolumeseveryrun• Morecommonlyusedfordimensiontables
DailyDifferentials• RunsduringnightlyETLwindow• Requireschangedatacapturetoidentifychangedrows
• Generallyconsistsofaseriesofstepswhereeachdependsontheprevioussteps
• Mustincludelogictohandleslowlychangingdimensions
MethodologiesInsert-only– Date-based• ExtractsdatabydaterangetoeliminateneedforCDC
• Simplifiedprocessing• Commonlyusedforfacttables• Changestodatafrompreviousperiodsrequiredeletionofalldataforthegivenrange
DatabaseReplication• Generallyrunsinnear-real-time• Requiresatoolthatistightlyintegratedwiththesourcedatabase
• Schemasmustmatchbetweensourceanddestination
MethodologiesBatchProcessing• Generallyusedwhendataisbeingpushedfromthesource
• Batchfrequencydependsonthevolumeandvelocityofthedata
• Requiresautomatedprocesstoloadbatchesintothedatawarehouse.
Streaming• Generallyusedforhighvolumedata
• Event-basedratherthanrow-based
• Oftenrequiresmicro-batchingofdataforloadintorelationaldatabase
• Rawdatamustusuallybetransformedtosupportanalytics
Data IntegrationSolutions
Data Integration SolutionsCustomCode• Flexiblebutcomplex
• Leveragesin-databaseprocessing
• Challengingtomanageandmaintain
ETL• Simplifieddatatransformationwithnocode
• Built-independencyanderrorhandling
• ReducesdatavolumeswithinEDW
ELT• Leverages benefits of
ETL while shifting data processing to EDW
• Requires tight integration between Data Integration and EDW
• Raw and transformed data in one place
Data Integration SolutionsOn-Premises• Customerownshardwareandsoftwareinstall/configuration
• Don’thavetodealwithfirewalltoaccesslocalsources
Cloud• Customerownssoftwareinstall/configurationbutnothardware
• CanrunincustomerVPCtoprovidedirectaccesstodatawithinVPCorbehindfirewall
SaaS• Fully managed by
service provider• Configurable options
vary by solution• Must find secure ways
to access data not stored inside firewall
Conclusion
Cloud Data Warehousing Best Practices• Leveragethescalablecomputelayertodothebulkofthedata
processing• Isolateloadandtransformjobsfromqueriestopreventresource
contention• Eliminatephysicaldatamartsbyleveragingascalabledataplatform• QAiskey,makesureallchangesmadetodataintegrationtasksare
testedbeforetheyrolltoproduction• Whenmigratingitisimportanttoconvertonesourceatatime