The Road to Digital TransformationDell EMC Cloudera Syncsort ETL Offload Hadoop Solution
December 2016
Armando Acosta
Dell EMC
Sean Anderson
Cloudera
Mark Muncy
Syncsort
Ted Arden
Dell EMC
Dell - Internal Use - Confidential3 of 123 of 22
The digital transformation will cause disruption
48%don’t know what their
industry will look like
in 3 years
78%feel threatened
by digital startups
45%fear they may
become obsolete
in 3-5 years
Business leaders see a chaotic, uncertain future ahead
Source: Digital Transformation Index, October, 2016Research by Vanson Bourne & Dell Technologies exploring the implications of digital disruption around the world, how companies aretransforming to meet changing customer demands and business leaders’ plans to succeed in the connected future.
Dell - Internal Use - Confidential4 of 12
Businesses still have a huge opportunity to get this right
73%say a centralized
tech strategy needs
to be a priority
72%plan to expand
their software
development
capabilities
66%are incentivized
to invest in IT
infrastructure
and digital skills
leadership
This is how leaders plan to leap ahead
Source: Digital Transformation Index, October, 2016Research by Vanson Bourne & Dell Technologies exploring the implications of digital disruption around the world, how companies aretransforming to meet changing customer demands and business leaders’ plans to succeed in the connected future.
4 of 22
Dell - Internal Use - Confidential5 of 12
Leaders agreed the following digital businessattributes are imperatives to success
Source: Digital Transformation Index, October, 2016Research by Vanson Bourne & Dell Technologies exploring the implications of digital disruption around the world, how companies aretransforming to meet changing customer demands and business leaders’ plans to succeed in the connected future.
Predictively spotnew opportunities
Demonstrate transparency
and trust
Deliver uniqueand personalized
experiences
Innovate inagile ways
Operate inreal time
Big Data and Analytics will be at the core to enabling
all these attributes
5 of 22
Dell - Internal Use - Confidential6 of 12
Data-driven organizations are more effective
greater revenue growth
for businesses that
leverage data effectively
50%
But 44%Become data-driven. A journey begins with a single step.
Align IT / Business goalsImprove operational efficiencyTransform your organization
of organizations do not know how to start…
Data from Dell Global Technology Adoption Index, November 2015
6 of 22
Dell - Internal Use - Confidential7 of 12
Align business and IT
Dell helps by
Utilizing ALL data to deliver deeper insights and enhanced data-driven decision making.
Organizational goals
.
.
. Empower end Users
Control costs
Improve outcomes
SReducing TCO and seamlessly integrating with existing investments to enable greater ROI
Providing secure anywhere, anytime access to data and analytics for improved productivity.
7 of 22
Ted Arden, Dell EMC
8 of 22
Dell - Internal Use - Confidential9 of 12
Traditional tools are not working
#1 ChallengeOrganizations cite TCO as biggest obstacle to data integration tools
Dell accelerates time to value by lowering data transformation costs & improve performance by augmenting the Enterprise Data Warehouse (EDW)
Dell EMC Cloudera Syncsort ETL Offload Hadoop Solution reduces Hadoop deployment to weeks, develop Hadoop ETL jobs within hours, and become fully productive within daysafter deployment
of all Data Warehouses are performance and capacity constrained
*Gartner70%
Data integration and transformation drive a majority of the EDW capacity
80%
9 of 98
Dell - Internal Use - Confidential10 of 12
Too many workloads in the EDWModernize the data pipeline with Hadoop
Traditional data pipeline
Enterprise data warehouse + ETLData transformation jobs
Business reportingQuery
Data staging toolExtract and load dataClean and parse data
Disparate data sources
The results
Longer data transformationjob times
Not meeting SLAs forbusiness reporting
Slow Ad Hoc Query
Too costly to scale
Perf
Capacity
10 of 98
Modern data pipeline
Enterprise data warehouseBusiness reporting
Query
Hadoop + ETLData transformation jobsClean, parse, transform
Disparate data sources
The results
Reduced data transformation job times
Improved SLAs forbusiness reporting
Fast Ad Hoc Query
Scales Economically
Perf
Capacity
Dell - Internal Use - Confidential11 of 12
Customer value
Dell Services
Reference ArchitectureETL Offload
PE R730XD, Networking
Solution stack Components Customer value
Faster deploymentfrom months to weeks
Hadoop Distribution Cloudera 5.9 Data management
and security
Data TransformationSyncsort
DMX-h version 9.1 Convert SQL jobs into
native Hadoop execution
Deploymentbusiness application
Build operationalefficiency with Hadoop
No other vendor offers this solution
11 of 98
Dell - Internal Use - Confidential12 of 12
Dell data solutions drive operational efficiency
Reduce data warehouse administrative costs up to 76%
Controlcosts
Transform data 60% faster for analysisImprove
productivity
Develop and design complex data transformation jobs up to 54% faster
Simplify ongoing operations
12 of 98
Dell - Internal Use - Confidential13 of 12
Dell EMC Cloudera Syncsort ETL offload Hadoop Solution
Solution benefits
• Integrates easily with Hadoop®
• No coding necessary for easy deployment
• No need for expertise on Apache Pig™, Hive™, and Sqoop™
• Closes the skills gap using Syncsort
Differentiation
• Reduces EDW admin costs up to 76%1
• Transforms data 60 percent faster for analysis2
• Designs transformation jobs up to 54% faster3
Primary use case: Scale out solution to optimize data management,
processing and analytics
Pod Network
2x Dell EMC Networking S4048 10GbE Pod Switches
1x S3124 iDRAC Switch
Data Nodes
10x Dell EMC PowerEdge R730xd with 3.5 Drives – 48 TB or
10x PowerEdge R730xd with 2.5” Drives – 24TB or 20x PowerEdge FC630 / FD332 – 32 TB
Infrastructure Nodes
1x Dell EMC PowerEdge™ R630 Admin Node
3x PowerEdge R730xd Name Nodes
1x PowerEdge R730xd Edge Node or
1x PowerEdge FC630 Name Nodes Admin Node
3x PowerEdge FC630 Name Nodes
1x PowerEdge FC630 Edge Node
Cluster Network
2x Dell EMC Networking S6000 40GbE Cluster Switches
Cloudera ™ Enterprise
Syncsort™ DMX-h™
1Cost advantages report2Performance advantages report3Design advantages report
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42Stack-ID
LNK1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
ACT50 52 5433 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 51 53
Stack-ID
LNK1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
ACT50 52 5433 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 51 53
12
01
24
11
21
16
10
41
08
96
10
0
88
92
80
84
72
76
64
68
56
60
48
52
40
44
32
36
24
28
16
20
81
2
04
Stack ID
12
01
24
11
21
16
10
41
08
96
10
0
88
92
80
84
72
76
64
68
56
60
48
52
40
44
32
36
24
28
16
20
81
2
04
Stack ID
Stack No.
1
2
25 26SFP+
3 5 7 9 11
4 6 8 10 12
13 15 17 19 21
14 16 18 20 22 24
LNK ACT1
2
23
LNK ACT
COMBO PORTS 23 24
6 720 1 14 1512 1310 118 9 22 2320 2118 1916 17
6 720 1 14 1512 1310 118 9 22 2320 2118 1916 17
6 720 1 14 1512 1310 118 9 22 2320 2118 1916 17
13 of 22
Dell - Internal Use - Confidential14 of 12
Operational Efficiency: From use case to action
Source 1. Connect 3. Act2. Analyze
Preventive Maintenance
IT Resource Capacity and Unitization
Operational Process Improvement
Business Process Cost Optimization
Cyber Security Analytics
Improved Forecasting
Compliance and Reporting
Operational data sources
Extract, transform load Business reporting and query
Enterprise datawarehouse
Enterprise datawarehouse
Relationalmanagement database
RelationalManagement database
Data mart Data mart
Services • Management • Infrastructure • Security • Dell Financial Services
ParseClean
Translate
SortAggregate
Group
Compute+ Data
14 of 22
Sean Anderson, Cloudera
15 of 22
16© Cloudera, Inc. All rights reserved.
Traditional Monolithic Analytic Databases
No Cloud Elasticity or Cloud Storage Integration
Rigid Data Model with Tightly Coupled Storage/Compute
Limited to SQL with Data Movement Necessary
Static Sizing
∞
COMPUTE
STORE
17© Cloudera, Inc. All rights reserved.
Challenges Across the Business
Enterprise Architect
Existing Systems Hitting Their Limits
• How long does it take to bring in more data/use cases? And what would the cost be?
• What is your process for scaling today?
• What is your plan for cloud?
Missed SLAs & Overloaded Bottleneck
• How much time do you spend troubleshooting vs developing new uses?
• How long does it take to deliver on business requests?
Limited Data & Insights of Latent Value
• What limits on users, data, and time period exist?
• How long does it take to get new reports/data?
• Are you able to run actionable real-time analysis?
Meet Compliance Needs & Protect Data
• How do you manage siloed security & governance across workloads and systems?
• Is sensitive data available for analysis?
IT/DBASecurity Team & Data Steward
SQL Developer & Business Analyst
18© Cloudera, Inc. All rights reserved.
Cloudera’s Analytic Database Solution
Identify, offload, & optimize workloads to
Hadoop
Navigator Optimizer
Intelligent SQL editor
Hue
Audit, lineage, encryption, key
management, & policy lifecycles
Navigator
Integration with the leading BI tools
BI Partners
Interactive query engine for BI & SQL analytics
Impala
Large-scale ETL & batch processing engine
Hive-on-Spark
Multi-Storage, Multi-Environment
19© Cloudera, Inc. All rights reserved.
The DCC Rule
D C CComplexity
Maximize your optimization opportunities by exposing complex access patterns that make the best use of Hadoop’sarchitecture
Compatibility
Reduce development time by leveraging existing query compatibilities with Hadoop tools and get guidance for query rewrites
Duplication
Improve performance by easily detecting workload duplication and recommending top queries to optimize
20© Cloudera, Inc. All rights reserved.
Cloudera Navigator OptimizerUnlock Your Best Hadoop Strategy, Instantly
Active Data Optimization for Hadoop to save you time and money
• Instant workload insights
• Intelligent optimization guidance
• Reduce Hadoop workload development effort
Mark Muncy, Syncsort
21 of 22
22 of 22
Goals of the Modern Data Architecture
• Centralize all your dataCollect raw data from every source from within the enterprise, regardless of complexity. Only when you are able to collect and retain all your data, you can see the full picture.
• Turn raw data into insightCleanse, blend and transform your data, give it context and meaning so decision makers can execute.
• Maintain governance, compliance and security standardsIncrease consistency and confidence in decision making by preserving the confidentiality, integrity and availability of information. Protect data from unauthenticated and unauthorized access.
• Eliminate complexities within ITYour Modern Data Architecture should automate and optimize your data needs, keep pace with the evolution of technology, and homogenize platforms and infrastructures.
23Syncsort Confidential and Proprietary - do not copy or distribute
Shift Data and ELT Workloads out of Data Warehouses
24Syncsort Confidential and Proprietary - do not copy or distribute
Simplify Big Data Integration with Syncsort
25Syncsort Confidential and Proprietary - do not copy or distribute
Access Integrate Comply Simplify
Get best in class data ingestion capabilities for Hadoop. Mainframes, RDBMS, MPP, JSON, Parquet, Avro, ORC, NoSQL, Kafka and more.
Single interface for streaming and batch processes. Single data pipeline for all enterprise data, batch or streaming.
Secure data access, data governance and lineage. Seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
Design once, deploy anywhere & insulate your organization from rapidly changing eco-system. Future proof your applications for new compute frameworks, on premise or in the cloud.
Simplify Big Data Integration with Syncsort
26Syncsort Confidential and Proprietary - do not copy or distribute
Access
Get best in class data ingestion capabilities for Hadoop. Mainframes, RDBMS, MPP, JSON, Parquet, Avro, ORC, NoSQL, Kafka and more.
Access: Bring ALL Enterprise Data Securely to the Data Lake
• Collect virtually any data from mainframe to relational, cloud and NoSQL sources
• Batch & streaming sources
• Access, re-format and load data directly into Hive & Parquet. No staging required!
• Pull hundreds of tables at once into your data hub, whole DB schemas in one invocation
• Load more data into Hadoop in less time
27Syncsort Confidential and Proprietary - do not copy or distribute
Build Your Enterprise Data Hub
Access: Get Your Database data into Hadoop, At the Press of a Button
• Pull multiple data sources and funnel into your data lake --extract and move whole DB schemas in one invocation
• One-step data movement, auto-generating jobs • Process multiple funnels in parallel on your edge node or
from data nodes‒ Leverages DMX-h high speed data engine via DTL‒ Generated applications can be imported into GUI
• In-flight transformations‒ Filtering, funnel dependency ordering, mixed source/target,
data type filtering, table exclusion/inclusion
28Syncsort Confidential and Proprietary - do not copy or distribute
DMX DataFunnel™
Simplify Big Data Integration with Syncsort
29Syncsort Confidential and Proprietary - do not copy or distribute
Access Integrate
Get best in class data ingestion capabilities for Hadoop. Mainframes, RDBMS, MPP, JSON, Parquet, Avro, ORC, NoSQL, Kafka and more.
Single interface for streaming and batch processes. Single data pipeline for all enterprise data, batch or streaming.
Integrate: Achieve the Fastest Path from Raw Data to Insight
• Prepare data on-the-fly
• Load into Hadoop without staging
• Write directly into Big Data formats (Parquet, Hive, etc.)
• Connect fast to NoSQL databases (Cassandra, HBase, etc.)
• Cloud Connectivity: Amazon AWS, Google Cloud Platform, Microsoft Azure
• Get the fastest, most efficient data joins and sorts
• Dynamic planning/optimization at runtime
• Create Tableau & Qlikview files with one click
• Fastest parallel loads to Amazon Redshift, Greenplum, Netezza, Oracle, Teradata & Vertica
30Syncsort Confidential and Proprietary - do not copy or distribute
Feed Business Intelligence Visualization
A single tool for designing both streaming and batch jobs
Integrate: Single Interface for Streaming & Batch
• Kafka, Spark, Apache Nifi, HDF
• Combine legacy batch and cutting edge streaming data sources
• Easy development in GUI – no need to write Scala, C or Java code
31Syncsort Confidential and Proprietary - do not copy or distribute
Simplify Streaming Data Integration
Simplify Big Data Integration with Syncsort
32Syncsort Confidential and Proprietary - do not copy or distribute
Access Integrate Comply
Get best in class data ingestion capabilities for Hadoop. Mainframes, RDBMS, MPP, JSON, Parquet, Avro, ORC, NoSQL, Kafka and more.
Single interface for streaming and batch processes. Single data pipeline for all enterprise data, batch or streaming.
Secure data access, data governance and lineage. Seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
Comply: Secure, Manage & Monitor Your Cluster
• Kerberos-secured clusters
– Authenticated browsing
– Authenticated sampling
• Apache Sentry security certified
• Cloudera Manager
– Deploy DMX-h across cluster
– Monitor DMX-h jobs
33Syncsort Confidential and Proprietary - do not copy or distribute
Comply: Get Governance, Metadata and Lineage
• Metadata and data lineage for Hive, Avro and Parquet through HCatalog
• Metadata lineage export from DMX
– Simplify audits, analytics dashboards, metrics
– Integrate with enterprise metadata repositories
• Cloudera Navigator certified integration
– Extends HCatalog metadata
– HDFS, YARN, Spark and other metadata
– Lineage, tagging
– Business and structural metadata
34Syncsort Confidential and Proprietary - do not copy or distribute
Simplify Big Data Integration with Syncsort
35Syncsort Confidential and Proprietary - do not copy or distribute
Access Integrate Comply Simplify
Get best in class data ingestion capabilities for Hadoop. Mainframes, RDBMS, MPP, JSON, Parquet, Avro, ORC, NoSQL, Kafka and more.
Single interface for streaming and batch processes. Single data pipeline for all enterprise data, batch or streaming.
Secure data access, data governance and lineage. Seamless integration with Kerberos, Apache Ranger, Apache Ambari, Cloudera Manager, Cloudera Navigator and Sentry.
Design once, deploy anywhere & insulate your organization from rapidly changing eco-system. Future proof your applications for new compute frameworks, on premise or in the cloud.
Simplify: Design Once, Deploy Anywhere
• Use existing ETL skills
• No need to worry about mappers, reducers, big side or small side of joins, and so on
• Automatic optimization for best performance, load balancing, etc.
• No changes or tuning required, even if you change execution frameworks
• Future-proof job designs for emerging compute frameworks, e.g. Spark
Single GUI Execute Anywhere!
36Syncsort Confidential and Proprietary - do not copy or distribute
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Using the Dell | Cloudera | Syncsort solution for Hadoop, an entry-level technician developed and deployed Hadoop ETL jobs in 53.7% less time than a Hadoop expert
Simplify: Reclaim days of valuable time
Fact dimension load with type 2 SCD
Data validation and pre-processing
Vendor mainframe file integration
Load Validate Int.
Source: http://en.community.dell.com/techcenter/blueprints/m/resources
37Syncsort Confidential and Proprietary - do not copy or distribute
Cut Development Time in Half!
8.3 Days
3.8 Days
Thank You
38 of 22