Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM...

BigDataTechnologyEcosystemMarkBurnettePentahoDirectorSalesEngineering,HitachiVantara

Agenda

• End-to-EndDataDeliveryPlatform• EcosystemofDataTechnologies

• MappinganEnd-to-EndSolution

• CaseStudies• PentahoKeyCapabilities• Summary

• Q&A

End-to-EndDataDeliveryPlatform

Ingest Process ReportPublish• DataAgnostic• MetadataDrivenIngestion• DataOrchestration

• NativeHadoopIntegration• ScaleUp&ScaleOut• BlendUnstructuredData

• StreamlinedDataRefinery• DataVirtualization• MachineLearning

• ProductionReporting• CustomDashboards• Self-ServiceDashboards• InteractiveAnalysis• EmbeddedAnalytics

DeliveringInsight

Ingest Process ReportPublish

ConsumersDataAnalystDataScientistsDataEngineers ProductionReporting

CustomDashboards

InteractiveAnalysis

Self-ServiceDashboards

DataIntegration&Orchestration

BigDataEcosystem

AnalyticalDatabases

SQLonHadoop

RelationalDatabase NoSQLDatabase

MessageStreaming

HDFSMapReduceDistributedSearch

EventStreamProcessing(ESP)

ComplexEventProcessing(CEP)

Volume(DataSize)

Small Medium Large

Variety(Data Type)

Structured Semi-Structured Unstructured

Velocity(Processing)

Batch Micro-Batch RTStreaming

Latency(Reporting)

Scheduled Prompted Interactive

DataSourceAttributes

AnalyticalDatabases

SQLonHadoop

MessageStreaming

DistributedSearch

HDFSMapReduce

RelationalDatabase MSFTSQLServer,Oracle,MySQL,PostGreSQL,IBMDB2

Volume(DataSize)

SmallOperationaldatabasesforOLTPappsthatrequirehightransactionloadsanduserconcurrency.Can“scaleup”todatavolumesbutlackabilitytoeasily“scale-out”forlargedataprocessing.

Medium

Variety(DataType)

StructuredStructuredschemaoftablescontainingrowsandcolumnsofdataemphasizingintegrityandconsistencyoverspeedandscale.Structured dataaccessedwiththeSQLquerylanguage.

Semi-Structured

Unstructured

Rigidschemaswithbatch-orientedingestionandSQLqueryprocessingarenotdesigned forcontinuousstreamingdataMicro-Batch

RTStreaming

Latency(Reporting)

Scheduled

OptimizedforfrequentsmallCRUDqueries(create,read,update,delete),notforanalyticorinteractivequeryworkloadsonlargedataPrompted

Interactive

RelationalDatabaseGoodFit

NotOptimal

NotRecommended

CoreCompetency

AnalyticalDatabase

Columnar,In-Memory,MPP,OLAPTeradata,OracleExadata,IBMNetezza,EMCGreenplum,Vertica

Volume(DataSize)

Datawarehouse/martdatabasestosupportBIandadvancedanalyticsworkloads.MPParchitecturegivesabilityto“scaleout”tolargedatavolumesatafinancial cost.Medium

Variety(DataType)

Structured

Structured schemaoftablescontainingrowsandcolumnsofdataofferingimprovedspeedandscalabilityoverRDBMSbutstilllimitedtostructureddata.Semi-Structured

Unstructured

Rigidschemaswithbatch-orientedSQLqueriesarenotdesignedforstreamingapplications.Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

Allfourtypes(Columnar,In-Memory,MPP,OLAP)designedforimprovedqueryperformanceforanalyticorinteractivequeryworkloadsonlargedata.Prompted

Interactive

AnalyticalDatabaseGoodFit

NotOptimal

NotRecommended

CoreCompetency

NoSQLDatabase MongoDB,HBase,Cassandra,MarkLogic,Couchbase

Volume(DataSize)

SmallGoodforwebapplications- lesswebappcodetowrite,debugandmaintain.Scaleout- horizontalscalingwauto-sharding datatosupportmillionsofwebappusers.Compromiseonconsistency(ACIDtransactions)infavorofscale&up-time.

Medium

Variety(DataType)

Structured

Hierarchical, key-valueordocumentdesigntocapturealltypesofdatainasinglelocation.Semi-Structured

Unstructured

BatchSchema-lessdesignallowsforrapidorcontinuousingestatscale.Goodstorageoptionforhighthroughput,lowlatencyrequirementsofstreamingapplicationsforreal-timeviewsofdata.SeenasakeycomponenttoLambdaarchitecture.

Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

Lowlevelquerylanguages,lackofskills,lackSQLsupportmakesNoSQLlessappealingforreportingandanalysis.Prompted

Interactive

NoSQLDatabaseGoodFit

NotOptimal

NotRecommended

CoreCompetency

HDFSMapReduce

Cloudera,Hortonworks,MapR,Pivotal,AmazonEMR,HitachiHSP,MSFTHDInsights

Volume(DataSize)

SmallHadoopDistributed FileSystemdesignedtodistributeandreplicatefileblockshorizontallyscaledacrossmultiplecommoditydatanodes.MapReduceprogrammingtakescomputetothedataforbatchprocessinglargedatavolumes.

Medium

Variety(DataType)

Structured

File systemisschema-lessallowingeasystorageofanyfiletypeinmultipleHadoopfileformats.Semi-Structured

Unstructured

HDFSandMapReducedesignedfordistributingbatchprocessingworkloadsonlargedatasets,notformicro-batchorsteamingusecases.Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

MapReduceonHDFSlacksSQLsupportandreportqueriesareslowandlessappealingforreportingandanalysis.Prompted

Interactive

HDFSMapReduceGoodFit

NotOptimal

NotRecommended

CoreCompetency

SQLonHadoop

Batch-oriented,Interactive,andIn-MemoryApacheHive,ApacheDrill/Phoenix,HortonworksHiveonTez,

ClouderaImpala,PivotalHawQ,SparkSQL

Volume(DataSize)

SmallSQLqueriesonametadatalayer(Hcatalog)inHadoop.ThequeriesareconvertedtoMapReduce,ApacheTez,ImpalaMPP,andSparkandrunondifferentstorageformatssuchasHDFSandHBase.

Medium

Variety(DataType)

StructuredSQLwasdesignedforstructureddata.Hadoopfilesmaycontainnesteddata,variabledata,schema-lessdata.ASQL-on-Hadoopenginemustbeabletotranslatealltheseformsofdatatoflatrelationaldataandoptimizequeries(Impala/Drill)

Semi-Structured

Unstructured

SQL-on-Hadoopenginesrequiresmartandadvancedworkloadmanagersformulti-userworkloadsdesignedforqueryprocessingnotstreamprocessing.Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled Ad-hocreporting,iterativeOLAP,anddatamining)insingle-userandmulti-usermodes.Formulti-userqueries,Impalaisonaverage16.4xfasterthanHive-on-Tez and7.6xfasterthanSparkSQLwithTungsten,withanaverageresponsetimeof12.8scomparedtoover1.6minutesormore.

Prompted

Interactive

SQLonHadoopGoodFit

NotOptimal

NotRecommended

CoreCompetency

DistributedSearch ElasticSearch,Solr (basedonApacheLucene),AmazonCloudSearch

Volume(DataSize)

SmallSearchengineshavetodealwithlargesystemswithmillionsofdocumentsandaredesignedforindexandsearchqueryprocessingatscalewithclusteringanddistributedarchitecture.

Medium

Variety(DataType)

StructuredXML,CSV,RDBMS,Word,PDF,ActiveMQ,AWSSQS,DynamoDB (AmazonNoSQL),FileSystem,Git,JDBC,JMS,Kafka,LDAP,MongoDB,neo4j,RabbitMQ,Redis,andTwitter.

Semi-Structured

Unstructured

BatchESscalabletoverylargeclusterswithnearreal-timesearch.Thedemandsofrealtimewebapplicationsrequiresearchresultsinnearrealtimeasnewcontentisgeneratedbyusers.Somecontentionhandlingconcurrentsearch+indexrequests.

Micro-Batch

RTStreaming

Latency(Reporting)

ScheduledBothusekey-valuepairquerylanguage.Solr ismuchmoreorientedtowardstextsearchwhileElasticsearch isoftenusedformoreadvancedquerying,filtering,andgrouping.Goodforinteractivesearchqueriesbutnotinteractiveanalyticalreporting.

Prompted

Interactive

DistributedSearchGoodFit

NotOptimal

NotRecommended

CoreCompetency

MessageStreaming Kafka,JMS,AMQP

Volume(DataSize)

KafkaisanexcellentlowlatencymessagingplatformthatbrokersmassivemessagestreamsforparallelingestionintoHadoopMedium

Variety(DataType)

Structured

Datasources,suchastheinternetofthings,sensors,clickstream,andtransactionalsystems.Semi-Structured

Unstructured

BatchRealtime streamingprovidinghighthroughputforbothpublishingandsubscribing,withconstantperformanceevenwithmanyterabytesofstoredmessages.Designedforstreamingandcanconfigurebatchsizeforbrokeringmicrobatchesofmessages.

Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

StreamtopicsneedtobeprocessedbyadditionaltechnologysuchasPDI,ESP,CEP,queryprocessingenginesforreporting.Prompted

Interactive

MessageStreamingGoodFit

NotOptimal

NotRecommended

CoreCompetency

MessageStreaming ApacheStorm

Volume(DataSize)

ApacheStormisadistributed“event-at-a-time”stream processingsystemforprocessinglargevolumesinparallel withsub-second latency.Medium

Variety(DataType)

StructuredStormapplicationsprocess1incomingeventatatimeastuplesofdata;atuplemaycancontainobjectofanytypesuchastheinternetofthings,sensors,andtransactionalsystems.

Semi-Structured

Unstructured

BatchStormisextremelyfast,withtheabilitytoprocessoveramillionmessagespersecondpernode.Compromisesonfaulttolerancebyoffering“atleastoncesemantics”infavorofspeed.

Micro-Batch

RTStreaming

Latency(Reporting)

ScheduledESPprovidesthemostrecentprocesseddataforalltypesofreporting.ExampleESPUseCase:StockmarkettickersshowingstockperformanceswithaGreenuparroworReddownarrowinrealtime.

Prompted

Interactive

EventStreamProcessing(ESP)GoodFit

NotOptimal

NotRecommended

CoreCompetency

MessageStreaming Spark, Flink

Volume(DataSize)

SparkandFlink aredistributed“micro-batch”streamprocessingenginesforprocessinglargevolumesofhigh-velocitydatainparallelwithafewsecondslatency.Medium

Variety(DataType)

Structured Complexeventprocessingforinternetofthings,sensors,andtransactionalsystems.Anaggregation-orientedCEPsolutionisfocusedonexecutingon-linealgorithmsasaresponsetoeventdataenteringthesystem.Detection-orientedCEPisfocusedondetectingcombinationsofeventscalledeventspatternsorsituations.

Semi-Structured

Unstructured

BatchMicro-batchprocessingengineswithfewsecondslatencythatisnotasfastasStorm,buthasbetterfaulttoleranceguaranteeing“exactlyoncesemantics”forstatefulcomputations.Greatformachinelearningcomputations.

Micro-Batch

RTStreaming

Latency(Reporting)

ScheduledCEPprovidesthemostrecentprocessed dataforalltypesofreporting. Example CEPusecase:usersetsupalerttothestockmarketsaying"letmeknowifGOOGstockswentupby10%andstayedupfor3hoursormore".

Prompted

Interactive

ComplexEventProcessing(CEP)GoodFit

NotOptimal

NotRecommended

CoreCompetency

BigDataEcosystem

AnalyticalDatabases

SQLonHadoop

MessageStreaming

HDFSMapReduceDistributedSearch

MappingASolution

RelationalDatabase

AnalyticalDatabase

NoSQLDatabase

HadoopFileSystem(HDFSMR)

SQLonHadoop

DistributedSearch

MessageStreaming

EventStream

Processing(ESP)

ComplexEvent

Processing(CEP)

Volume(DataSize)

Medium

Variety(DataType)

Structured

Semi-Structured

Unstructured

Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

Prompted

Interactive

MatrixforAnalyticsPerformance(MAP)GoodFit

NotOptimal

CoreCompetency

NotRecommended

PENTAHODATA

INTEGRATION

BIGDATASOURCES

PENTAHODATA

INTEGRATION

HADOOP/DATALAKE

ANALYTICDATASETS

PENTAHODATA

INTEGRATION

TRADITIONALDATA

PENTAHODATA

INTEGRATION

DATAWAREHOUSE

DATAMARTS

LINEOFBUSINESS

ANALYTICS

E X T RAN E TD E P LOYMENT S

EMB EDD EDANA LY T I C S

ON - D EMAND DATAMART

S E L F - S E RV I C E ANA LY T I C S

C EN T RA L I Z E D ANA LY T I C S AT S C A L E

BigDataProjects

ASingleFlow

DataPrepDataEngineering Analytics

Ingestion Processing Blending DataDelivery DataDiscovery/Analysis

Analysis&Dashboards

Administration Security LifecycleManagement

DataProvenance

DynamicDataPipeline Monitoring Automation

KeyTakeaways

• Dataarchitecturemodernizationinvolvesmanytechnologies

• Understandingtheecosystemofdatatechnologies

• Mappinganend-to-endsolution

• Pentahokeycapabilities

Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM...

Documents

Pentaho Data Integration Installation Guide - Huihoodocs.huihoo.com/pentaho/pentaho-business-analytics/4.8/install_pdi.pdf · Data Integration server Data Integration tools: ... the

CBD-Pentaho Big Data - Meetup

Pentaho com Hadoop – O Canivete Suíço do Cientistas de Dados para Big Data Analytics

Big Data Analytics - Do MapReduce ao dashboard com Hadoop e Pentaho

Business Intelligence and Big Data Analytics with Pentaho

Big Data Solutions Architecture Workshop Summary€¦ · Introduction to Hitachi Vantara’s Pentaho platform: Learn about the overall Pentaho Business Analytics (BA) platform and

Pentaho Big Data Analytics with Vertica and Hadoop

Augmented Data Warehouse | Big Data Management Platform | … · 2019-06-25 · Analytics Tools (Tableau, Power Bl, Pentaho,...) BRS± M-OLAP Cube . APACHE Spork @pentaho tþtalend

Big Data for BI - Beyond the Hype - Pentaho

Big Data Architecture con Pentaho

Pentaho Hadoop Big Data e Data Lakes

· Data Mining IQBËWeka deploy Model (Prescriptive) Pentaho Data Science Pack API 2: Big Data Analytics with Pentaho Hadoop and IOT Use case (big data) Pentaho-Hadoop (Hadoop version

Pentaho Big Data Integration and Analytics - Datasheet · Pentaho Big Data Integration and Analytics Within one platform — Pentaho — Hitachi Vantara provides big data tools to

Pentaho Google Hangout - Simplifying Analytics Architecture for Big Data

PostgreSQL em projetos de Business Analytics e Big Data Analytics com Pentaho

Pentaho & MongoDB Partner to Solve Government Big Data

Pentaho High-Performance Big Data Reference Configurations ...events.pentaho.com/rs/pentaho/images/Pentaho Cisco... · Cisco UCS and Pentaho BA can help businesses manage many different

Big Data Parsing XML on Pentaho Data Integration (PDI ... · • File List Processing in Pentaho MapReduce • Converting XML to Binary Format • Writing a Custom Input Formatter

EDW Optimization with Hadoop Big Data vFINAL - Pentahoevents.pentaho.com/rs/pentaho/images/Webinar 1 PPT.pdf · Pentaho: Quickest, Most Complete Solution for Big Data Design, develop

Pentaho and MongoDB Partner to Solve Government Big Data Challenges