Analy&csforthefastestgrowingcompanies
λarchitectureappliedatExponea
Mar&nStrýček,10.3.2016
Choose your stack
Intuitive approach
• Collect data to (no)SQL DB
• Running live queries against (no)SQL database • UI will be like SQL generator • But may/will eventually result in slow queries
• Batch preprocessing of data • Continues change of report definitions • Delays / Over night results / no more night
Conversion Funnel
Valid solutions - SQL
FROMeventse1LEFTJOINeventse2ONe1.customer_id=e2.customer_idANDe2.type='view_item'ANDe1.?mestamp<e2.?mestampLEFTJOINeventse3ONe2.customer_id=e3.customer_idANDe3.type='add_to_cart'ANDe2.?mestamp<e3.?mestamp
Valid solutions - NoSQL
varmap=func?on(){varsteps=['view_item','add_to_cart','buy’];varcounts=[0,0,0,0,0,0,0,0,0,0];vari=0;for(varjinthis.value.events){varevent=this.value.events[j];if(event['type']==steps[i]){counts[i]++;i++;if(i===steps.length)break;}}if(i>0)emit('funnel',{'counts':counts});};
Valid solutions - NoSQL
varreduce=func?on(key,values){varcounts=[0,0,0,0,0,0,0,0,0,0];for(variinvalues){for(varjinvalues[i].counts){counts[j]+=values[i].counts[j];}}return{'counts':counts};};db.embeded_customers.mapReduce(map,reduce,{out:'customers_matched_funnel_1'}).find();
Alternative solutions - custom in memory database
IMF – Customer data structure
IMF – Basic structure project1customer1
event1?mestampproper?es
property1,value1...
event2...proper?es
property1,value1...
…
IMF
• Sharding • Customer Id as sharding key
• Replication • IMF –master knows how many shards and
replicas are connected • Loading • From a stream of data
Apparchitecture
λarchitecture
λarchitecture
We have speed we need volume
• Fast layer is solved • Big data requirements • Loading old data into fast layer • 0 data expiration • Access to data from BI tools • Custom queries
λarchitecture
Map-R
• Map-R filesystem • Direct access to files that are stored within cluster • Faster than HDFS
• Map-R distribution • No dependency hell
λarchitecture
Datacollec?onAPI:Real?mevsAsync
• Realtime • Customer segments • Website customization • Recommendations • personalization
• Async • Do not lose data • Event driven campaigns
Data Collection
Real?me–webcustomiza?on
Eventriggercampaign
λarchitectureatExponea
Takeaways
• Lambda solves two contradictory challenges • Process data fast • Process very big data
• Apache Spark is good choice for both speed & batch
layer, anyway our IMF is way faster :-)
Thank you.