Upload
apoorva-gaurav
View
961
Download
2
Embed Size (px)
DESCRIPTION
This is the presentation given in Fifth Elephant Conference 2013. It talks about how we've created a cloud based big data which is low on maintenance and running cost. Key technologies used here are Twitter Finagle, Apache Kafka, Apache Zookeeper, Amazon S3 and Amazon EMR.
Citation preview
Cloud based low cost, low maintenance, scalable data platformApoorva Gaurav
Why hunt elephant to sell shoes ?
Why hunt elephant to sell shoes ?
WHOM
HOW
WHAT
Use case : List products based on CTR
● Take all impressions of a product and action performed
● Some products are more attractive than others
● Give benefit to such products
Use case : List products based on CTR
● select product_id, sum(clicked)/sum(appeared) as ctr from tbl_prod_log group by product_id order by ctr desc
● >100K products, > 500 million impressions a day --- DIFFICULT TO SCALE
Use case : User segmentation
● Different users have different browsing patterns
● Segment them based on their history
● Provide them different experience
Use case : User segmentation
● select depth, count(cookie_id), group by depth from user_log
● > 1m users daily, multiple browsers, devices
● DIFFICULT TO SCALE
Use case : Recommend similar products
● Compute score of products based on various attributes
● Compute score of a user based on products (s)he browses
● Recommend similar products
Use case : Recommend similar products
● select id, (w1.att1 + w2.att2 + ... wN.attN) as score from products
● select userid, (v1.score1 + v2.score2 + ... + vN.scoreN)
● >1m user >100K products DIFFICULT TO COMPUTE
Constraints
● Fast paced● Tangible results● Limited budget● Low engineering bandwidth
Design goals
● Solution should be able to scale up and down
● Record data now, ask questions later● Generic data model● Segregate reads from writes● Low running cost● Low maintenance overhead
Cloud computing
Pros● No setup cost● Pay as you use● Scaling is a breeze● Managed services
Cons● Performance● Reliability● Data security● Control
A very basic Big Data systemHighly availableVery low latencyInitial filteringStorage agnostic
Scale up and down easilyEssentially distributedVery easy to use
Highly reliableHuge capacityCater to any data modelCheap
Architecture Diagram
Architecture Diagram Hadoop on cloudEasy to scale up and downPay as you use
Infinite capacity11 nines of durabilityFlat file storageCheap
Persistent distributed Q100K msg/secEvents can be played back
Highly concurrent serverVery easy to useFlexible
Much easier to introduce HA, reliability etc Both server and client side data
Segregate and upload events to S3Scales horizontally
Distributed config mgmtFault tolerant
Some numbers
● ~20 million events getting logged daily● Corresponds to ~800 million data points● & ~25GB ● Close to a 100 jobs a day● The biggest job has footprints of ~2
billion events● Platform costs ~20$ daily; jobs ~15$ daily
● One can code in english (Finagle)myService = handleExceptions andThen recordInKafka andThen respond
● Need not be in C or Erlang to be performant (Kafka) ● Can search without indexs3://<BUCKET>/addToCart/y=2013/m=06/d=14/h=13/min=30s3://<BUCKET>/orderConfirmation/y=2013/m=06/d=14/h=13/min=30
● Spot EMR clusters effeciently● m1.small are not small● awk + grep = awesome● Apache mailing lists SUCK!!!
Some key learnings
Thank you!! & we are [email protected]