Upload
arpan-bhandari
View
36
Download
0
Tags:
Embed Size (px)
Citation preview
3
Big Data Trends
Resource sharing/isolation frameworks: Yarn, Mesos,
etc.Shared cluster workers (resources)
Multiple Execution engines: MapReduce, Spark, Hama,
Storm, Giraph, etc.
Data ETLing from all possible sources to Data
Lake
4
Hadoop based solution life stages
(as on ground) – Cyclic execution
xxxxxx
Business User Data Analyst MapReduce DevLogic & Data Test
DevopsStaging DataProduction
Bad Logic?
Resource Utilization ?
Bad Data?
Monitoring Needs
5
5
Challenges in Analytical Solutions
1. No common
platform across
actors to detect root
causes
2. Incremental
imports may ingest
bad data
3. Cluster
resources are
shared and optimal
utilization is key
4. Implementing
models in custom
MR in initial
attempts is like
hitting bull’s eye
5. Bad Logic or Bad
data
6
Intersecting solution Lifecycle Stages
xxxxxx
Solution
Development Quality Test
DevopsBulk & Incremental
Data
7
Jumbune
Flow AnalyzerData Validation Cluster Monitor Job Profiler
“A catalyst to accelerate realization of analytical solutions”
8
Niche offerings
• In depth code level analysis of cluster wide flow
• Record level data violation reports.
• No deployment on Workers - Ultra light agent installation on Hadoop master
only
• Ability to turn on/off cluster monitoring at will – lessens resource load
• Customizable rack aware monitoring
• Correlated profiling analysis of phases, throughput and resource consumption
• Ability to work across all Hadoop Distributions
9
Components - Recommended Environments
Dev
• Flow Debugger
• Data Validation
• MR Job Profiler
QA
• Data Validation
Stage + Perf
• MR Job Profiler
Prod
• Cluster Monitoring
• Data Validation
10
Supported Deployments
Jumbune
Azure, EC2
All major distributions
On Premise
11
MapReduce Flow Debugger
• Verifies the flow of input records in user’s map reduce implementation
• Drill down visualization helps developer to quickly identify the problem.
• Only tool to assist developers to figure out MapReduce implementation
faults without any extra coding
12
Data Validator
• Validates inconsistencies in data in the form of :
– Null checks
– Data type checks
– Regular expression checks
• Generic way of specifying validation rules
• Provides record level report for found anomalies
• Currently supports HDFS as the lake file system
13
MR Job Profiling
• Per Job Phase wise
– performance for each JVM
– data flow rate
– Resource usage
• Per Job Heap sites for Mapper & Reducer
• Per Job CPU cycles for Mapper & Reducer
14
Hadoop Cluster Monitoring
• Data Centre & Rack aware nodes view of Yarn and Non Yarn Daemons
• Dynamic Interval based monitoring
• Hadoop JMX, Node Resource Statistics
• Per file, node wise replica Placement (which nodes have replicas of a given
file ?)
• HDFS data placement view (HDFS balanced ?)
15
How we are building Jumbune?
16
Let’s Collaborate
Website
• http://jumbune.org
Contribute
• http://github.com/impetus-opensource/jumbune
• http://jumbune.org/jira/JUM
Social
• Follow @jumbune Use #jumbune
• Jumbune Group: http://linkd.in/1mUmcYm
Forums
• Users: [email protected]
• Dev: [email protected]
• Issues: [email protected]
Downloads
• http://jumbune.org
• https://bintray.com/jumbune/downloads/jumbune