Myntra.com's Big Data Platform

Cloud based low cost, low maintenance, scalable data platformApoorva Gaurav

Why hunt elephant to sell shoes ?

Why hunt elephant to sell shoes ?

WHOM

HOW

WHAT

Use case : List products based on CTR

● Take all impressions of a product and action performed

● Some products are more attractive than others

● Give benefit to such products

Use case : List products based on CTR

● select product_id, sum(clicked)/sum(appeared) as ctr from tbl_prod_log group by product_id order by ctr desc

● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE

Use case : User segmentation

● Different users have different browsing patterns

● Segment them based on their history

● Provide them different experience

Use case : User segmentation

● select depth, count(cookie_id), group by depth from user_log

● > 1m users daily, multiple browsers, devices

● DIFFICULT TO SCALE

Use case : Recommend similar products

● Compute score of products based on various attributes

● Compute score of a user based on products (s)he browses

● Recommend similar products

Use case : Recommend similar products

● select id, (w1.att1 + w2.att2 + ... wN.attN) as score from products

● select userid, (v1.score1 + v2.score2 + ... + vN.scoreN)

● >1m user >100K products DIFFICULT TO COMPUTE

Constraints

● Fast paced● Tangible results● Limited budget● Low engineering bandwidth

Design goals

● Solution should be able to scale up and down

● Record data now, ask questions later● Generic data model● Segregate reads from writes● Low running cost● Low maintenance overhead

Cloud computing

Pros● No setup cost● Pay as you use● Scaling is a breeze● Managed services

Cons● Performance● Reliability● Data security● Control

A very basic Big Data systemHighly availableVery low latencyInitial filteringStorage agnostic

Scale up and down easilyEssentially distributedVery easy to use

Highly reliableHuge capacityCater to any data modelCheap

Architecture Diagram

Architecture Diagram Hadoop on cloudEasy to scale up and downPay as you use

Infinite capacity11 nines of durabilityFlat file storageCheap

Persistent distributed Q100K msg/secEvents can be played back

Highly concurrent serverVery easy to useFlexible

Much easier to introduce HA, reliability etc Both server and client side data

Segregate and upload events to S3Scales horizontally

Distributed config mgmtFault tolerant

Some numbers

● ~20 million events getting logged daily● Corresponds to ~800 million data points● & ~25GB ● Close to a 100 jobs a day● The biggest job has footprints of ~2

billion events● Platform costs ~20$ daily; jobs ~15$ daily

● One can code in english (Finagle)myService = handleExceptions andThen recordInKafka andThen respond

● Need not be in C or Erlang to be performant (Kafka) ● Can search without indexs3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30

● Spot EMR clusters effeciently● m1.small are not small● awk + grep = awesome● Apache mailing lists SUCK!!!

Some key learnings

Thank you!! & we are [email protected]

Technology

Myntra.com's Big Data Platform