18
Cloud based low cost, low maintenance, scalable data platform Apoorva Gaurav

Myntra.com's Big Data Platform

Embed Size (px)

DESCRIPTION

This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.

Citation preview

Page 1: Myntra.com's Big Data Platform

Cloud based low cost, low maintenance, scalable data platformApoorva Gaurav

Page 2: Myntra.com's Big Data Platform

Why hunt elephant to sell shoes ?

Page 3: Myntra.com's Big Data Platform

Why hunt elephant to sell shoes ?

WHOM

HOW

WHAT

Page 4: Myntra.com's Big Data Platform

Use case : List products based on CTR

● Take all impressions of a product and action performed

● Some products are more attractive than others

● Give benefit to such products

Page 5: Myntra.com's Big Data Platform

Use case : List products based on CTR

● select product_id, sum(clicked)/sum(appeared) as ctr from tbl_prod_log group by product_id order by ctr desc

● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE

Page 6: Myntra.com's Big Data Platform

Use case : User segmentation

● Different users have different browsing patterns

● Segment them based on their history

● Provide them different experience

Page 7: Myntra.com's Big Data Platform

Use case : User segmentation

● select depth, count(cookie_id), group by depth from user_log

● > 1m users daily, multiple browsers, devices

● DIFFICULT TO SCALE

Page 8: Myntra.com's Big Data Platform

Use case : Recommend similar products

● Compute score of products based on various attributes

● Compute score of a user based on products (s)he browses

● Recommend similar products

Page 9: Myntra.com's Big Data Platform

Use case : Recommend similar products

● select id, (w1.att1 + w2.att2 + ... wN.attN) as score from products

● select userid, (v1.score1 + v2.score2 + ... + vN.scoreN)

● >1m user >100K products DIFFICULT TO COMPUTE

Page 10: Myntra.com's Big Data Platform

Constraints

● Fast paced● Tangible results● Limited budget● Low engineering bandwidth

Page 11: Myntra.com's Big Data Platform

Design goals

● Solution should be able to scale up and down

● Record data now, ask questions later● Generic data model● Segregate reads from writes● Low running cost● Low maintenance overhead

Page 12: Myntra.com's Big Data Platform

Cloud computing

Pros● No setup cost● Pay as you use● Scaling is a breeze● Managed services

Cons● Performance● Reliability● Data security● Control

Page 13: Myntra.com's Big Data Platform

A very basic Big Data systemHighly availableVery low latencyInitial filteringStorage agnostic

Scale up and down easilyEssentially distributedVery easy to use

Highly reliableHuge capacityCater to any data modelCheap

Page 14: Myntra.com's Big Data Platform

Architecture Diagram

Page 15: Myntra.com's Big Data Platform

Architecture Diagram Hadoop on cloudEasy to scale up and downPay as you use

Infinite capacity11 nines of durabilityFlat file storageCheap

Persistent distributed Q100K msg/secEvents can be played back

Highly concurrent serverVery easy to useFlexible

Much easier to introduce HA, reliability etc Both server and client side data

Segregate and upload events to S3Scales horizontally

Distributed config mgmtFault tolerant

Page 16: Myntra.com's Big Data Platform

Some numbers

● ~20 million events getting logged daily● Corresponds to ~800 million data points● & ~25GB ● Close to a 100 jobs a day● The biggest job has footprints of ~2

billion events● Platform costs ~20$ daily; jobs ~15$ daily

Page 17: Myntra.com's Big Data Platform

● One can code in english (Finagle)myService = handleExceptions andThen recordInKafka andThen respond

● Need not be in C or Erlang to be performant (Kafka) ● Can search without indexs3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30

● Spot EMR clusters effeciently● m1.small are not small● awk + grep = awesome● Apache mailing lists SUCK!!!

Some key learnings

Page 18: Myntra.com's Big Data Platform

Thank you!! & we are [email protected]