View
248
Download
2
Category
Preview:
Citation preview
BigDataTechnologyEcosystemMarkBurnettePentahoDirectorSalesEngineering,HitachiVantara
Agenda
• End-to-EndDataDeliveryPlatform• EcosystemofDataTechnologies
• MappinganEnd-to-EndSolution
• CaseStudies• PentahoKeyCapabilities• Summary
• Q&A
End-to-EndDataDeliveryPlatform
Ingest Process ReportPublish• DataAgnostic• MetadataDrivenIngestion• DataOrchestration
• NativeHadoopIntegration• ScaleUp&ScaleOut• BlendUnstructuredData
• StreamlinedDataRefinery• DataVirtualization• MachineLearning
• ProductionReporting• CustomDashboards• Self-ServiceDashboards• InteractiveAnalysis• EmbeddedAnalytics
DeliveringInsight
Ingest Process ReportPublish
ConsumersDataAnalystDataScientistsDataEngineers ProductionReporting
CustomDashboards
InteractiveAnalysis
Self-ServiceDashboards
DataIntegration&Orchestration
BigDataEcosystem
AnalyticalDatabases
1
4
7
2
5
8
3
6
9
SQLonHadoop
RelationalDatabase NoSQLDatabase
MessageStreaming
HDFSMapReduceDistributedSearch
EventStreamProcessing(ESP)
ComplexEventProcessing(CEP)
Volume(DataSize)
Small Medium Large
Variety(Data Type)
Structured Semi-Structured Unstructured
Velocity(Processing)
Batch Micro-Batch RTStreaming
Latency(Reporting)
Scheduled Prompted Interactive
DataSourceAttributes
AnalyticalDatabases
SQLonHadoop
RelationalDatabase NoSQLDatabase
MessageStreaming
DistributedSearch
EventStreamProcessing(ESP)
ComplexEventProcessing(CEP)
HDFSMapReduce
RelationalDatabase MSFTSQLServer,Oracle,MySQL,PostGreSQL,IBMDB2
Volume(DataSize)
SmallOperationaldatabasesforOLTPappsthatrequirehightransactionloadsanduserconcurrency.Can“scaleup”todatavolumesbutlackabilitytoeasily“scale-out”forlargedataprocessing.
Medium
Large
Variety(DataType)
StructuredStructuredschemaoftablescontainingrowsandcolumnsofdataemphasizingintegrityandconsistencyoverspeedandscale.Structured dataaccessedwiththeSQLquerylanguage.
Semi-Structured
Unstructured
Velocity(Processing)
Batch
Rigidschemaswithbatch-orientedingestionandSQLqueryprocessingarenotdesigned forcontinuousstreamingdataMicro-Batch
RTStreaming
Latency(Reporting)
Scheduled
OptimizedforfrequentsmallCRUDqueries(create,read,update,delete),notforanalyticorinteractivequeryworkloadsonlargedataPrompted
Interactive
RelationalDatabaseGoodFit
NotOptimal
NotRecommended
CoreCompetency
AnalyticalDatabase
Columnar,In-Memory,MPP,OLAPTeradata,OracleExadata,IBMNetezza,EMCGreenplum,Vertica
Volume(DataSize)
Small
Datawarehouse/martdatabasestosupportBIandadvancedanalyticsworkloads.MPParchitecturegivesabilityto“scaleout”tolargedatavolumesatafinancial cost.Medium
Large
Variety(DataType)
Structured
Structured schemaoftablescontainingrowsandcolumnsofdataofferingimprovedspeedandscalabilityoverRDBMSbutstilllimitedtostructureddata.Semi-Structured
Unstructured
Velocity(Processing)
Batch
Rigidschemaswithbatch-orientedSQLqueriesarenotdesignedforstreamingapplications.Micro-Batch
RTStreaming
Latency(Reporting)
Scheduled
Allfourtypes(Columnar,In-Memory,MPP,OLAP)designedforimprovedqueryperformanceforanalyticorinteractivequeryworkloadsonlargedata.Prompted
Interactive
AnalyticalDatabaseGoodFit
NotOptimal
NotRecommended
CoreCompetency
NoSQLDatabase MongoDB,HBase,Cassandra,MarkLogic,Couchbase
Volume(DataSize)
SmallGoodforwebapplications- lesswebappcodetowrite,debugandmaintain.Scaleout- horizontalscalingwauto-sharding datatosupportmillionsofwebappusers.Compromiseonconsistency(ACIDtransactions)infavorofscale&up-time.
Medium
Large
Variety(DataType)
Structured
Hierarchical, key-valueordocumentdesigntocapturealltypesofdatainasinglelocation.Semi-Structured
Unstructured
Velocity(Processing)
BatchSchema-lessdesignallowsforrapidorcontinuousingestatscale.Goodstorageoptionforhighthroughput,lowlatencyrequirementsofstreamingapplicationsforreal-timeviewsofdata.SeenasakeycomponenttoLambdaarchitecture.
Micro-Batch
RTStreaming
Latency(Reporting)
Scheduled
Lowlevelquerylanguages,lackofskills,lackSQLsupportmakesNoSQLlessappealingforreportingandanalysis.Prompted
Interactive
NoSQLDatabaseGoodFit
NotOptimal
NotRecommended
CoreCompetency
HDFSMapReduce
Cloudera,Hortonworks,MapR,Pivotal,AmazonEMR,HitachiHSP,MSFTHDInsights
Volume(DataSize)
SmallHadoopDistributed FileSystemdesignedtodistributeandreplicatefileblockshorizontallyscaledacrossmultiplecommoditydatanodes.MapReduceprogrammingtakescomputetothedataforbatchprocessinglargedatavolumes.
Medium
Large
Variety(DataType)
Structured
File systemisschema-lessallowingeasystorageofanyfiletypeinmultipleHadoopfileformats.Semi-Structured
Unstructured
Velocity(Processing)
Batch
HDFSandMapReducedesignedfordistributingbatchprocessingworkloadsonlargedatasets,notformicro-batchorsteamingusecases.Micro-Batch
RTStreaming
Latency(Reporting)
Scheduled
MapReduceonHDFSlacksSQLsupportandreportqueriesareslowandlessappealingforreportingandanalysis.Prompted
Interactive
HDFSMapReduceGoodFit
NotOptimal
NotRecommended
CoreCompetency
SQLonHadoop
Batch-oriented,Interactive,andIn-MemoryApacheHive,ApacheDrill/Phoenix,HortonworksHiveonTez,
ClouderaImpala,PivotalHawQ,SparkSQL
Volume(DataSize)
SmallSQLqueriesonametadatalayer(Hcatalog)inHadoop.ThequeriesareconvertedtoMapReduce,ApacheTez,ImpalaMPP,andSparkandrunondifferentstorageformatssuchasHDFSandHBase.
Medium
Large
Variety(DataType)
StructuredSQLwasdesignedforstructureddata.Hadoopfilesmaycontainnesteddata,variabledata,schema-lessdata.ASQL-on-Hadoopenginemustbeabletotranslatealltheseformsofdatatoflatrelationaldataandoptimizequeries(Impala/Drill)
Semi-Structured
Unstructured
Velocity(Processing)
Batch
SQL-on-Hadoopenginesrequiresmartandadvancedworkloadmanagersformulti-userworkloadsdesignedforqueryprocessingnotstreamprocessing.Micro-Batch
RTStreaming
Latency(Reporting)
Scheduled Ad-hocreporting,iterativeOLAP,anddatamining)insingle-userandmulti-usermodes.Formulti-userqueries,Impalaisonaverage16.4xfasterthanHive-on-Tez and7.6xfasterthanSparkSQLwithTungsten,withanaverageresponsetimeof12.8scomparedtoover1.6minutesormore.
Prompted
Interactive
SQLonHadoopGoodFit
NotOptimal
NotRecommended
CoreCompetency
DistributedSearch ElasticSearch,Solr (basedonApacheLucene),AmazonCloudSearch
Volume(DataSize)
SmallSearchengineshavetodealwithlargesystemswithmillionsofdocumentsandaredesignedforindexandsearchqueryprocessingatscalewithclusteringanddistributedarchitecture.
Medium
Large
Variety(DataType)
StructuredXML,CSV,RDBMS,Word,PDF,ActiveMQ,AWSSQS,DynamoDB (AmazonNoSQL),FileSystem,Git,JDBC,JMS,Kafka,LDAP,MongoDB,neo4j,RabbitMQ,Redis,andTwitter.
Semi-Structured
Unstructured
Velocity(Processing)
BatchESscalabletoverylargeclusterswithnearreal-timesearch.Thedemandsofrealtimewebapplicationsrequiresearchresultsinnearrealtimeasnewcontentisgeneratedbyusers.Somecontentionhandlingconcurrentsearch+indexrequests.
Micro-Batch
RTStreaming
Latency(Reporting)
ScheduledBothusekey-valuepairquerylanguage.Solr ismuchmoreorientedtowardstextsearchwhileElasticsearch isoftenusedformoreadvancedquerying,filtering,andgrouping.Goodforinteractivesearchqueriesbutnotinteractiveanalyticalreporting.
Prompted
Interactive
DistributedSearchGoodFit
NotOptimal
NotRecommended
CoreCompetency
MessageStreaming Kafka,JMS,AMQP
Volume(DataSize)
Small
KafkaisanexcellentlowlatencymessagingplatformthatbrokersmassivemessagestreamsforparallelingestionintoHadoopMedium
Large
Variety(DataType)
Structured
Datasources,suchastheinternetofthings,sensors,clickstream,andtransactionalsystems.Semi-Structured
Unstructured
Velocity(Processing)
BatchRealtime streamingprovidinghighthroughputforbothpublishingandsubscribing,withconstantperformanceevenwithmanyterabytesofstoredmessages.Designedforstreamingandcanconfigurebatchsizeforbrokeringmicrobatchesofmessages.
Micro-Batch
RTStreaming
Latency(Reporting)
Scheduled
StreamtopicsneedtobeprocessedbyadditionaltechnologysuchasPDI,ESP,CEP,queryprocessingenginesforreporting.Prompted
Interactive
MessageStreamingGoodFit
NotOptimal
NotRecommended
CoreCompetency
MessageStreaming ApacheStorm
Volume(DataSize)
Small
ApacheStormisadistributed“event-at-a-time”stream processingsystemforprocessinglargevolumesinparallel withsub-second latency.Medium
Large
Variety(DataType)
StructuredStormapplicationsprocess1incomingeventatatimeastuplesofdata;atuplemaycancontainobjectofanytypesuchastheinternetofthings,sensors,andtransactionalsystems.
Semi-Structured
Unstructured
Velocity(Processing)
BatchStormisextremelyfast,withtheabilitytoprocessoveramillionmessagespersecondpernode.Compromisesonfaulttolerancebyoffering“atleastoncesemantics”infavorofspeed.
Micro-Batch
RTStreaming
Latency(Reporting)
ScheduledESPprovidesthemostrecentprocesseddataforalltypesofreporting.ExampleESPUseCase:StockmarkettickersshowingstockperformanceswithaGreenuparroworReddownarrowinrealtime.
Prompted
Interactive
EventStreamProcessing(ESP)GoodFit
NotOptimal
NotRecommended
CoreCompetency
MessageStreaming Spark, Flink
Volume(DataSize)
Small
SparkandFlink aredistributed“micro-batch”streamprocessingenginesforprocessinglargevolumesofhigh-velocitydatainparallelwithafewsecondslatency.Medium
Large
Variety(DataType)
Structured Complexeventprocessingforinternetofthings,sensors,andtransactionalsystems.Anaggregation-orientedCEPsolutionisfocusedonexecutingon-linealgorithmsasaresponsetoeventdataenteringthesystem.Detection-orientedCEPisfocusedondetectingcombinationsofeventscalledeventspatternsorsituations.
Semi-Structured
Unstructured
Velocity(Processing)
BatchMicro-batchprocessingengineswithfewsecondslatencythatisnotasfastasStorm,buthasbetterfaulttoleranceguaranteeing“exactlyoncesemantics”forstatefulcomputations.Greatformachinelearningcomputations.
Micro-Batch
RTStreaming
Latency(Reporting)
ScheduledCEPprovidesthemostrecentprocessed dataforalltypesofreporting. Example CEPusecase:usersetsupalerttothestockmarketsaying"letmeknowifGOOGstockswentupby10%andstayedupfor3hoursormore".
Prompted
Interactive
ComplexEventProcessing(CEP)GoodFit
NotOptimal
NotRecommended
CoreCompetency
BigDataEcosystem
AnalyticalDatabases
1
4
7
2
5
8
3
6
9
SQLonHadoop
RelationalDatabase NoSQLDatabase
MessageStreaming
HDFSMapReduceDistributedSearch
EventStreamProcessing(ESP)
ComplexEventProcessing(CEP)
MappingASolution
RelationalDatabase
AnalyticalDatabase
NoSQLDatabase
HadoopFileSystem(HDFSMR)
SQLonHadoop
DistributedSearch
MessageStreaming
EventStream
Processing(ESP)
ComplexEvent
Processing(CEP)
Volume(DataSize)
Small
Medium
Large
Variety(DataType)
Structured
Semi-Structured
Unstructured
Velocity(Processing)
Batch
Micro-Batch
RTStreaming
Latency(Reporting)
Scheduled
Prompted
Interactive
MatrixforAnalyticsPerformance(MAP)GoodFit
NotOptimal
CoreCompetency
NotRecommended
PDI
PENTAHODATA
INTEGRATION
BIGDATASOURCES
PENTAHODATA
INTEGRATION
HADOOP/DATALAKE
ANALYTICDATASETS
PENTAHODATA
INTEGRATION
TRADITIONALDATA
PENTAHODATA
INTEGRATION
DATAWAREHOUSE
DATAMARTS
LINEOFBUSINESS
ANALYTICS
E X T RAN E TD E P LOYMENT S
EMB EDD EDANA LY T I C S
ON - D EMAND DATAMART
S E L F - S E RV I C E ANA LY T I C S
C EN T RA L I Z E D ANA LY T I C S AT S C A L E
BigDataProjects
ASingleFlow
DataPrepDataEngineering Analytics
Ingestion Processing Blending DataDelivery DataDiscovery/Analysis
Analysis&Dashboards
Administration Security LifecycleManagement
DataProvenance
DynamicDataPipeline Monitoring Automation
KeyTakeaways
• Dataarchitecturemodernizationinvolvesmanytechnologies
• Understandingtheecosystemofdatatechnologies
• Mappinganend-to-endsolution
• Pentahokeycapabilities
Recommended