The Evolution of Big Data Pipelines at Intuit

Preview:

Citation preview

The Evolution of Big Data Pipelines At IntuitJune 30, 2016

#hadoopsummit #HS16SJ

Your Speakers

Lokesh RajaramSenior Software Engineer, Intuit

likes Photography

Rekha JoshiPrincipal Software Engineer, Intuit

Currently likes Chopped

The Plan

Unicellular Amoeba

Multicellular Humans

Cannot Evolve? Disappear..

Gone!

Evolution of Big Data

Our Mission

To improve our customers’ financial lives so profoundly … they can’t imagine going back to the old way!

Consumers Small Businesses Accounting Professionals

Who we serve

42M 2.3M 7MFile their own taxes with

TurboTaxRun their small businesses

with QuickBooksManage their personal finances

with Mint

The Numbers Are Growing

65+ Applications, 25% of US GDP

Era of Windows Era of

Web

Era of the Cloud

Era of DOS

Intuit - An Evolution Case Study

Compliantdata

Mobile First

1980s 1990s 2000s

• Employees: 150• Customers: 1.3M customers• Revenue: $33M

• Employees: 4,500 • Customers: 5.6M • Revenue: $1.04B

• Employees: 7,700• Customers: 37M• Revenue: $4.2B

20162010

Regulatory data Transactional data Batch data Real time data Complex, secure data

Data Is The Decision Maker

Evolution of Big Data Pipelines – The Need

Secure Cloud Environment

Single Cohesive Data Pipeline

AB Testing

Personalization

StreamingProfile Store

Fraud Detection

Support Varied Use Cases

and more..

Evolution of Big Data Pipelines

Thin Slices - Minimal Viable Product

Evolution of Big Data Pipelines – The Recipe

Taking the Data In

Transforming Data

Handling The Indigestion With Scale

Evolution of Big Data Pipelines – The Recipe

No SnowflakesSolutions

Getting Vested Stakeholders Agreements

Establishing The

Standards

Evolution of Big Data Pipelines – The Recipe

Breaking The Silos

Moving Organization In

One Direction

Evolution of Big Data Pipelines – The Recipe

● Making The Configuration Knobs Work● At Scaleo Latency o Throughput

● Schema, PII, Metadata, Changes, Audit, Governance ● Controlled Access←→ Innovation● Error Monitoring ● Cluster Deployment

Organization Evolution Data Evolution

SDK

User-entered data

Apache Kafka

Collector: User-entered and clickstream data

Real-time processing

Personalization Engine

Profile Store

Big Data Pipeline Slice View

Big Data Pipeline Components

Monitoring The Pipeline

AWS resource alarms

Custom App MetricsJVM and App Metrics

Custom process alerts

Logging and alert

Evolution In Stages

Evolution - Stage 0: Disparate And Chaotic

Disparate Databases

Data Pipeline (an example)

• Collect event stream data into one location

• Handle ~ 200k events / sec

• Payload ~ 3-5KB

• Enrich message and load it into Hive in defined SLA

Evolution - Stage 1

Event Stream

OozieSqoop

Netezza LoaderHive QL

operationsStormSamzaFlume

Evolution - Stage 2

Event Stream

{ ReST }

Evolution - Stage 3 (HA & DR)SDK

{ ReST }SDK

{ ReST }Mirroring

Challenges & Opportunities

Set of Changes

• Network upgrades

• Increase pipe

• Broker

• Mirrormaker

• Host TCP

Evolution - Stage 4 (Streaming + Batch)

SDK

{ ReST }

SDK

{ ReST }

Mirroring

Evolution - Stage 5 (Cloud only)

SDK

{ ReST }

Kafka Connectors

Evolution - Stage 5 (Cloud only - Future state)

Pipeline Essentials

SDK { ReST }

SDK { ReST }

Traffic Rate Monitoring

Trust by Verification

• Test all Observable End-points• Functional• Data Loss• Data Parity

• Measure for SLA• Baseline Tests

Interested in Joining?

goo.gl/BLPfyR

Thank You!