Big Data Ready Enterprise

Big Data Ready Enterprise (BDRE) | Open Source Product 1

An Open Source Product

Big Data Ready Enterprise (BDRE)


Speakers

Arijit Banerjee •  BDRE Product Architect

- Wipro Technologies Rahul Sarda •  Big Data Practice Head

- Wipro Technologies


Agenda

How BDRE addresses the needs across the lifecycle

Fast track implementations using BDRE

Demo

BDRE In Action: Implementations Underway

Typical enterprise deployment view with BDRE

2

3

4

5

6

BDRE Roadmap 7

1 Typical Big Data use cases and common challenges


Typical Uses Cases & Common Challenges

When implementing and operationalizing at large scale

Jumpstart the framework implementation

Shorten the implementation cycle for applications

Reduce repetitive work across multi-step process

Robust application deployment support

Support flexible operations & SLA

Assure quality of service

Batch Data Processing

Enterprise Data Provisioning

Platform

Complex Pipeline Transformation &

Semantic Processing

Data as a Service

Event Stream & Micro batch Processing

Low Latency Store

Migration of EDW workloads

Enterprise Analytical Platform

How to ?


How BDRE addresses the needs across the lifecycle

Pluggable Architecture

Community Driven

Distribution Compatible

Implementation Jumpstart

Key Benefits

Basic Hadoop – at the base

“Pre-built features”

BDRE

§  Operational Metadata

§  Ingestion Accelerators

§  Ad-hoc Data Movement

§  Visual Data Pipeline

§  Workflow Automation

§  One Touch Deployment

§  SLA Management

§  Rich Visualization

Implementation effort from scratch

APPPLICATIONS

HADOOP

Reduce implementation effort

Complements Effort Already Spend


Data Quality and Data Profiling

Operational Metadata & Lineage

Data Ingestion via Multiple Sources

One Touch Deployment

Analytics & Visualization

Job Automation & Security Integration

FastTrack Implementation using BDRE Key things that can be rapidly implemented using the product

§  Job designer with dependency, metadata & batch Information tagging

§  UI based Semantic & ETL Framework §  Authentication : Integration with Kerberos & JAAS

§  DWH features: SCD2 implementation

§  Supports Hive, Pig, Map Reduce, Spark, R, Python

§  Automated central deployment and application management.

§  Registry of all workflow processes / templates §  Automated Workflow Generator – Oozie & others

§  Automated Process flow Planner

§  Workflow Designer

§  Support for Executing Models – R, Python, Spark §  Zero Coding UI based configuration for common

use cases §  User Interface based metadata interaction& search

§  Data Exploration integration with notebooks

§  Visual Representation of workflow

§  Abstraction layer: Component to ingest variety of data (CPY, XML, DB, Mainframes)

§  Streaming Data Ingest – 16 sources with Twitter, Flume, IOT, logs, message queue

§  File Monitoring: Component to check validity of incoming data at file and record level

§  C2C Hive Table Migration

§  Enforce Data Quality and Data processing rules (during ingestion or post ingestion)

§  DQ Analysis, Integrity & Failure Handling §  Data Loading - Test Data Generation

§  Real time state of jobs & run control §  Visual flow of the Business Workflow / Process

§  Multi & Incremental Process Pipelining §  Visuals for Process Lineage & Auditing

§  Job Performance Analysis & Optimization


Demo


Key Features

Job registry

Dependency management

Batch management/tracking.

Run control

Execution status

Ingestion registry

Execution statistics logging (key/value)

Hive queries and data lineage information.

Job monitoring & alerting

HTML 5 user interface and REST APIsa

Access operational metadata

Realtime job monitoring

Dependency Pipeline

Hive data lineage

SLA Monitoring

Plugin management

Operational Metadata Management

Plugins

Teradata Queries in workflow Support

Test data manufacturer

Web Crawler

C2C Hive Table Migration

Data Quality

Tabular data from RDBMS

Streaming data from multiple sources

File ingestion by directory monitoring

Automation

One click deploy/schedule

One click app store

Job export and sharing

Custom job automation

Oozie workflow generation Workflow designer Batch lineage

Generalized UI Data Ingestion Framework

Workflow Automation


BDRE In Action: Implementations Underway

For leading global bank reduce the implementation cycle by 40+% for linking millions of transactions daily across client groups and it lines of business covering more than 65 countries

For the largest consumer electronics proof of concept for redistribution of the data across multiple clusters with enablement of incremental workloads.

For large retailer managing SLA's, automated scheduling for its 1000+ jobs across multiple line of business reducing the effort by 50% and enabled automated deployment

Improved regulatory reporting cycle by 40%

Enabled annual savings of$2.5mn for EDW optimization

Support faster data migration across distribution

For a large bank in UK we are enabling real time workloads using new age technologies enabling several reusable templates for data transformations, prof i l ing and real t ime ingests covering 9000+ data sources

For large energy giant support the movement of predictive models on parasitic event detection in advance.

For a large retailer enabling migration of on-premise EDW workloads for more than 1200+ complex entities by moving the workload on to Amazon EMR & Redshift

Estimated savings of $3mn for real time streaming Platform

Cloud enablement with projected savings of $2mn

Data exploration and Predictive modeling


Typical enterprise deployment view with BDRE

Eventing Framework

Espresso Email

SLA notification

Proactive Reporting

Job

Job

Job

Job

Job

Hadoop Cluster

Data Quality Workflow

Non Hadoop

Workflows Ingestion Workflow

Semantic Workflow

Bulk Data generation Workflow

APP Store (Git Repo)

NN RM

Oozie Work flow Generator

Job Deploy Scripts

Job Export/Import

Edge Node

BDRE UI App

App Server

JAAS

BDRE Rest API

Browser

Operational Metadata

RDBMS Metastore

Rule Engine(for DQ)


Features in pipeline

Enhancing Foundation Optimizing Core Creating Value

§  Support for execution of predictive models in Spark ML

§  Secured & Robust Data Governance

§  Real Time Data Transformation workflows

§  Graphical Query builder §  Rich plugin library §  Self Service BI §  Data Quality Rule

Builder

§  Plugins for common unstructured data processing

§  PDF

§  Images

§  Videos §  Centralize monitoring

across platforms


Get Involved

Project Site §  http://wiproopensourcepractice.github.io/openbdre/

Source Code §  https://github.com/wiproopensourcepractice/openbdre

Community Dashboard §  https://github.com/wiproopensourcepractice/openbdre

BDRE Core Team Arijit Banerjee Sri Harsha Boda Kapil Paliwal Mishi Vidya Sinku Sudam Madhav Prem Kumar Rahul Sarda

THANK YOU


Additional Slides


BDRE Screenshots


Batch Batch

Metadata driven Operational Management BDRE provides out of the box run control, SLA management and workflow generation

UI Metadata

API Layer

Batch & Real Time data provisioning with profiling & data quality

checks

Wor

kflo

w

Workflow1

Action Action

Action Action Action

Action Action Action Action

Action

Workflow1

Action Action



Action

Workflow1

Action Action



Action

Semantic Processing &

Transformation Pipeline

Enqueues batches on completion

Queue

Batch

Batch Batch

Dedicated batch queue



Workflow1

Action Action




Data exploration & Predictive

modeling


Queue

Batch

Batch

Batch

Queue

Process 1 Process 2

Process 3 & 4 Process 1 Process 2

Process 3

Process 4

Logical pipeline between the processes

Batch

Action


BDRE Metadata Management system

§ Source_batch_id BIGINT § target _batch_id BIGINT § queue_id BIGINT § insert_ts TIMESTAMP § source_process_id INT § start_process_id INT § end_ts TIMESTAMP § batch_state INT § batch_marketing INT § Batch_marketing VARCHAR (45) § process_id INT

BATCH_CONSUMP_QUEUE

§ batch_state_ INT § Description VARCHAR (45)

BATCH_STATUS

§ source_batch_id BIGINT § target_batch_id BIGINT § queue_id BIGINT § source_process_id INT § insert_ts TIMESTAMP § start_ts TIMESTAMP § end_ts TIMESTAMP § batch_state INT § batch_marketing VARCHAR (45) § process_id INT

ARCHIVE_CONSUMP_QUEUE

§ instance_exec_id BIGINT § process_id INT § start_ts TIMESTAMP § End_ts TIMESTAMP § exec_state INT

INSTANCE_EXEC

§ batch_id BIGINT § sever_id INT § Path VARCHAR (45) § File_size BIGINT § File_hasg VARCHAR (45 § creation_ts TIMESTAMP

FILE

§ server_id INT § server_type VARCHAR (45) § server_name VARCHAR (45) § server_metainfo VARCHAR (…) § login_user VARCHAR (45) § ssh_private_key VARCHAR (…) § server_ip VARCHAR (45)

SERVERS

§ etl_process_id INT § raw_table_id INT § base_table_id INT § insert_type SMALLINT § drop_raw BOOLEAN § raw_view_id INT

ETL_DRIVER

§ batch_id BIGINT § source_process_run_id BIGINT § batch_type VARCHAR (45)

BATCH

§ process_id INT § config_group VARCHAR (10) § key VARCHAR (45) § value VARCHAR (2048) § description VARCHAR (1028)

PROPERTIES

§ process_id INT § description VARCHAR (256) § add_ts TIMESTAMP § process_name VARCHAR (45) § bus_domain_id INT § process_type_id INT § parent_process_id INT § can_recover BOOLEAN § enqueuing_process_id INT § batch_cut_pattern VARCHAR

(45) § next_process_id VARCHAR (456)

PROCESS

§ bus_domain_id INT § decription VARCHAR (256) § bus_domain_name VARCHAR

(45) § bus_domain_owner VARCHAR

(45)

BUS_DOMAIN

§ Exec_state_id INT § Description VARCHAR (45)

EXEC_STATUS

§ process_type_id INT § process_type_name VARCHAR

(45)

PROCESS_TYPE

§ table_id INT § comments VARCHAR (256) § location_type VARCHAR (45) § dbname VARCHAR (45) § batch_id_partition_col VARCHAR

(45) § table_name VARCHAR (45) § type VARCHAR (45) § ddl VARCHAR (2048)

HIVE_TABLES


Automated Oozie Workflow Generation Process id Parent id Next Steps Enqueuer Id

100 null 101,102 null

101 100 103 null

102 100 103 Null

103 100 100 null

Fork

InitJob 100

103

HaltStep 101

HaltStep 102

InitStep 103

HaltStep 103

HaltJob 100

Join

101 102

InitStep 101 InitStep 101


Intra and Inter Process Dependency

Process 401

Process 402 Process 400

Process 101

Process 102

Process 103

Process 203

Process 204

Process 205

Process 202

Process 201

Process 100

Process 200

Process 301

Process 302

Process 303

Process 304

Process 300 Pid Enq id Parent id

300 Null Null

301 100 300

302 Null 300

303 Null 300

304 200 300


Partition Based on Run Execution

Insert overwrite

Base View Select * from base where

runid=compact;

File Load job run# 4




History View Select * from base where runid !=

compact;

This partition is over written on every file load. This contains compacted data (one single latest records for each key). The value of ‘compact’ is basically MAX

value of BIGINT datatype

In this diagram one single table contains both complete history and current records in two different partitions. If full history not

needed the runid partitions are auto-dropped after compaction.

Base Table

Runid=1

Runid=2

Runid=3

Runid=4

Runid=Compact

These runid partitions are created after each file load. The runid is automatically supplied by BDRE run control system.

Compaction Job Runs automatically after File Load

job. It operates on the last compacted partition and the latest run partition

and rewrites into the compacted partition


Job Status Management

InitJob

HaltJob (success)

TermJob (failure)

InitStep

HaltStep (Success)

TermStep (Failure)

BDRE Operational

Metadata

Fail queue

Success queue

Consumer

JIRA

MQ

§  Halt and TermJob APIs can send message to MQ for proactive alerting §  Alternatively BDRE could directly connect to any alerting/ticket mgmt system

skipping the MQ


Replicating Similar Jobs with add-on Process

Data Ingestion Process 1

Semantic Process 2

Analytics Process 3

Properties For 1

Properties For 2

Properties For 3

Properties Template Core Process Template

CREATE NEW REPLICA/PUSH CHANGES


Hive Partition Pruning

Populated by process X

Insert overwrite T2 Partition (country ,state ,runid ) Select coi1, col2 , $ {target_batched}, country_col, State_col, ${instanceexecid} from T1 Where batched between ($ {min_batch} , $ {max_barch_;

Propcess X: Hive Query running from Oozie

BDRE Operational

Metadata

initjob

target_batched=201 instanceexecid=3 min_src_batch=3 max_src_batch=5

Partition based on batchid

batchid=1

batchid=2

batchid=3

batchid=4

batchid=4

Base Hive Table T1

Y : Upstream of X

First business partition

Second Business partition

Partition on runid

Country=us

State=ma

State=ri

State=ca

State=tx

Country=uk State=en

Runid=1

Runid=1

Runid=1

State-=sk

Runid=1

Runid=1

Runid=1

Semantic Hive Table T2

Already processed in earlier runs

Unprocessed batches from the queue


Rule definition

Rule engine UI

Data Quality Component

Map only MR job

Mapper 1 Mapper 2 Mapper n

Rules

Bad records Good records

Hadoop Original file with all records

Guvnor API


300 301

302 303

304

Batch Management

101

102

103

201

202

203

204

205

400

401 402 Batch

Batch

Batch Batch

Queue

Batch

Batch Queue

Batch

Logical pipeline between the processes



Workflow id 400

Batch

A row is added to the queue table for all downstream upon each successful execution of an upstream process.

Downstream looks up the queue and process all pending batches en-queued by

upstream.

Multiple source batch consumed = one target batch

is produced

Workflow 300

100

200

Workflow id 200

Workflow id 100

Technology

Big Data Ready Enterprise