Upload
cask-data-inc
View
181
Download
0
Embed Size (px)
Citation preview
HydratorCode-free Data Pipelines
for Hadoop, Spark, and HBaseJonathan Gray, CEO @ Cask
Global Big Data Conference - August 30th, 2016
cask.co
Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
cask.co
About Me
2
cask.co
Hadoop Enables New Apps and Patterns
3
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
ENTERPRISE DATA LAKES BIG DATA ANALYTICS PRODUCTION DATA APPS
Batch and Realtime Data Ingestion
Any type of data from anytype of source in any volume
Batch and Streaming ETLCode-free self-service creationand management of pipelines
SQL Exploration andData Science
All data is automaticallyaccessible via SQL and client SDKs
Data as a ServiceEasily expose generic or
custom REST APIs on any data
360o Customer ViewIntegrate data from any source
and expose through queries and APIs
Realtime DashboardsPerform realtime OLAP
aggregations and serve them through REST APIs
Time Series AnalysisStore, process and serve massive
volumes of time-series data
Realtime Log AnalyticsIngestion and processing of high-throughput streaming
log events
Recommendation EnginesBuild models in batch using
historical data and serve them in realtime
Anomaly Detection SystemsProcess streaming events and predictably compare them in
realtime to historical data
NRT Event MonitoringReliably monitor large streams of data and perform defined actions
within a specified time
Internet of ThingsIngestion, storage and processing of events that is highly-available,
scalable and consistent
cask.co4
Web Analytics and Reporting Use Case
✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts
✦Not enough people with expertise in all the Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka) or a general lack of expertise
✦Hard to debug and validate, resulting in frequent failures in production environment
✦Difficult to integrate into SQL / BI reporting solutions for business users
✦As use cases advance into Data Science, Machine Learning, and Predictive Analytics you need to include scientists and advanced ML programmers
Transform web log data from S3 every hour to Hadoop cluster for backup, as well as, perform analytics and enable realtime reporting of metrics such as number of successful/failure responses, most popular pages, etc.
The Challenges —
cask.co
The Many Faces of Hadoop
5
Developer
Advanced Programming
Focused on App Logic
Data Scientist
Basic Dev & Complex Analytics
Focused on Data & Algorithms
IT Pro / Ops
Configuring & Monitoring
Focused on Infrastructure & SLA’s
LOB / Product
Decision Making & Driving Revenue
Focused on Apps & Insights
Challenge: The tools are missing to connect these users and take apps from prototype to production
cask.co6
Enter Cask
Key Customers and Partners
Named a Gartner Cool Vendor 2016
Founded in 2011 by early Hadoop engineers from Facebook and Yahoo!
cask.co
Introducing the Data Application Platform
7
Deployment Models
On-premises Hybrid Cloud
Governance Operations
Pre-packaged Integrations
Orchestration/Automation/Workflows
Core Application and Data Integration
Role-based User Experience
Developer Data Scientist
IT /Ops
cask.co
Introducing the Cask Data App Platform
8
Open Source, Integrated Framework for
Building and Running Data Applications
on Hadoop and Spark
• Supports all major Hadoop distros • Integrates the latest Big Data technologies • 100% open source and highly extensible
cask.co9
What’s in CDAP ?
A self-service, re-configurable, code-free framework to build, run and operate real-time or batch data pipelines in cloud or on-premise.
A self-service tool for tracking the flow of data in and out of Data Lake. Track, Index and Search technical, business and operational metadata of applications and pipelines
An integration platform that integrates and abstracts underlying Hadoop technologies. Build data analytics solutions in cloud or on-premise.
cask.co10
A self-service, code-free framework to build, run and operate data pipelines
on Apache Hadoop and Spark
Built for Productionon CDAP
Rich Drag-and-DropUser Interface
Open Source &Highly Extensible
cask.co11
INGESTany data from any source
in real-time and batch
BUILDdrag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESSany data to any destination
in real-time and batch
Hydrator Data Pipelinesprovide the ability to automate complex workflows that involves fetching data, possibly from multiple
data sources, combining, performing non-trivial transformations and aggregations on the data, writing it to one more data sinks and making it available for applications and analytics
cask.co12
Stack of Data Enablers
cask.co13
Hydrator Studio
✦Drag-and-drop GUI for visual Data Pipeline creation
✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
✦Hadoop-native and Hadoop Distro agnostic
cask.co14
Hydrator Data Pipeline
✦Captures Metadata, Audit, Lineage info, discovered and visualized using Cask Tracker
✦Notifications, scheduling, and monitoring with centralized metrics and log collection for ease of operability
✦Simple Java API to build your own source, transforms, sinks with class loading isolation
✦Javascript and Python transforms
✦ Include arbitrary Spark jobs
cask.co15
✦ Elastic, SFTP, Cassandra, Kafka, RDBMS, EDW and many more sources and sinks
✦ Parse/Encode/Hash, Distinct/Group By, Custom JavaScript/Python Transforms
Out of the box Integrations
cask.co16
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API
Custom Plugins
cask.co17
Pipeline Implementation
Logical Pipeline
Physical Workflow
MR/Spark Executions
Planner
CDAP
✦Planner converts logical pipeline to a physical execution plan
✦Optimizes and bundles functions into one or more MR/Spark jobs
✦CDAP is the runtime environment where all the components of the data pipeline are executed
✦CDAP provides centralized log and metrics collection, transaction, lineage and audit information
cask.co18
Pipeline Implementation
cask.co19
Support for fine-grain role-based authorizing of entities in CDAP
Integration with Sentry and Ranger
Security — Authentication and Authorization
Ability to preview pipelines with real or injected data before deploying (Standalone)
Security — Impersonation and Encryption
Learn about how datasets are being used and the top applications accessing it
Tracker — Data Usage Analytics
Support for annotating business metadata based on business specified taxonomy
Metadata Taxonomy
Build and run Hydrator real-time pipelines using Spark Streaming
Hydrator — Spark Streaming
Ability to run CDAP and CDAP Apps as specified users and ability to
encrypt/decrypt sensitive configuration
Hydrator — Preview Mode
Capability to join multiple streams (inner & outer) and ability to configure actions allowing one to run binaries on designated nodes
Hydrator — Join & Action
Support for XML, Mainframe (COBOL Copybook), Value Mapper, Normalizer, Denormalizer, JsonToXml, SSH Action, Excel Reader, Solr & Spark ML
Hydrator — Plugins
New CDAP 3.5 - Latest Features
cask.co20
Demo ExampleLoad Log Files from S3 to HDFS and perform aggregations/analysis
• Start with web access logs stored in Amazon S3
• Store the raw logs into HDFS Avro Files
• Parse the access log lines into individual fields
• Calculate the total number of requests by IP and status code
• Find out IPs which received maximum successful status code and error codes
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info
Sample Web access log (Combined Log Format):
cask.co21
Thanks!
Jonathan Gray @jgrayla
Download CDAP w/ Hydrator: http://cask.co/downloads/