17
Enterprise Grade Spark Processing at Totango 2015-11-10 Oren Raboy, VP Eng. @ Totango

How Totango uses Apache Spark

Embed Size (px)

Citation preview

Enterprise Grade Spark Processing at Totango2015-11-10

Oren Raboy, VP Eng. @ Totango

AGENDA (PART 1)

• Background about Totango and our data architecture

• Spark in the Totango Architecture

• Quality: Testing Spark code in production

About us:

Founded: 2010Offices: TLV, SFTeam: ~60Customers: ~200

We help online business make their customers

more successful through use of data

Totango:

~ 500M accounts~ $5B revenue under management~ 100M events per day

Our Customers: The Worlds Leading Cloud Services

ANALYTICS

• Usage Metrics

• Trends over time

• Trends across customers

• Health score

• Alerts

• Triggered Workflows

• Email Campaigns

AUTOMATION

Totango Data Architecture

Collection

Real-time processing

Batch processing

Pixel

3rd Party(SFDC)

CSV

Serving Layer

• ‘Lambda Architecture’• Hosted on AWS• AWS and Open-source technologies• Java with a dash of Python

Totango Data Architecture

Collection

Real-time processing

Batch processing

Pixel

3rd Party(SFDC)

CSV

Serving Layer

• Hosted on AWS• ‘Lambda Architecture’• AWS and Open-source technologies• Java with a dash of Python

Kinesis

Kinesis

S3

ELB

Totango Data Architecture

Collection

Real-time processing

Batch processing

Pixel

3rd Party(SFDC)

CSV

Serving Layer

• Hosted on AWS• ‘Lambda Architecture’• AWS and Open-source technologies• Java with a dash of Python

Kinesis

Kinesis

S3

ELB

Batch Processing• Executed once a day (midnight at customer’s local-time)• Each task calculates a set of account-metrics (e.g. Health,

Change)• One Spark cluster runs all tasks for all customers• Pipeline executed by Pipeline Runner, using Spotify Luigi

calcsome

metrics

calcother

metrics

more

mergeresults

Somedependent

computation

Mergeresults

Into finaldocument

Raw Events Account Documents

Environment• Multi tenant: Shared infrastructure for all Totango customers

(Services) • Daily, hourly and on-demand schedule• Standalone Spark cluster on AWS EC2 instances• Input and Output on S3. Final results also indexed on

ElasticsearchService A

calcsome

metrics

calcother

metrics

more

mergeresults

Somedependent

computation

Mergeresults

Into finaldocument

Raw Events Account Documents

Service A

calcsome

metrics

calcother

metrics

more

mergeresults

Somedependent

computation

Mergeresults

Into finaldocument

Raw Events Account Documents

Service XYZ

calcsome

metrics

calcother

metrics

more

mergeresults

Somedependent

computation

Mergeresults

Into finaldocument

Raw Events Account Documents

Challenge: Quality

Requirements from infrastructure:• Reliability: Calculate metrics accurately at all times• Velocity: Frequent release of new data processing code

Challenge: High quality and highly automated regression testing

calcsome

metrics

calcsomemetric

more

mergeresults

Somedependent

computation

Mergeresults

Into finaldocument

Raw Events Account Documents

NEWVERSION

How do we make sure the new version didn’t break anything?

calcsome

metrics

mergeresults

Somedependent

computation

Mergeresults

Into finaldocument

Raw Events Account Documents

NEWVERSIONSHADOW

OLD VERSION

compare csv

Testing In Production: How• Before deployment, run release-candidate ‘side by side’ older version.

• New version runs in Shadow mode and does not propagate results

• Compare old and new version results. Output unexpected diffs• Deploy to production only if no diffs across all customer data

sets

1.

2.

3.

4.

5.

Unit testing

Test environment: Integration testing

Side by side testing in production of new code

New code rolled-out, old versionside-by-side as backup

Rollout complete!

Deployment Flow

• We know the new version works correctly

• We do not need to think of all the corner test-cases

• We do not need to write lots of regression tests

QUESTIONS?• labs.totango.com <-- engineering team

blog• [email protected] <-- me!• Yes, we are hiring!