Delivering a 'Big Data Ready' minimum viable product

Preview:

DESCRIPTION

In most cases talking about big data follows an "a posteriori" view where an organization overwhelmed by huge amounts of log files and numerous data sources scattered among its departments decides to put some order to the mess and get some value out of the "big data", usually building a Hadoop cluster. In this presentation I take the opposite direction and try to demonstrate how to proactively design and build product architectures that manage to remain simple and lean while at the same time anticipate the big data complexities and solve them easily and elegantly from day one.

Citation preview

DELIVERINGA'BIGDATAREADY'

MVP

GregoryChomatas

DublinGoogleDevelopersGroup-2013July30th

http://linkedin.com/in/gchomatas

SWEngineer

CHOMATASGREGORY

t:@gchomatas

http://www.astroboa.org

Entrepreneur

Betaconcept/Astroboa:Founder

Aquinetix:Co-founder/CTO

7YEARSAGOIREALIZED...

TOOMUCHRDBMSSODA

LOTSOFOBJECT-RELATIONALMISMATCH

DBISNOTTHECENTEROFMYAPPLICATION

DomainDrivenDesign/BehaviourDrivenDesign

DatabaseDrivenDesignvs

ATTHATTIMENOTMANYALTERNATIVESEXISTED

sowedecidedtorollourowndatastoresolution...

ASTROBOATOTHERESCUEHybridDocument-GraphStorefocusedondatasemantics

SimilartoGoogleDatastore&OrientDB

External'appindependent'SemanticDataModelModelasyougoSecurityperEntityinstance/propertyVersionedEntitiesAutomatedRESTAPIsencapsulatingthedatalayerHyperlinkedResourcesPolyglotPersistence(Experimental)*

*Notavailableinthepublicversion

THE"BIGNESS'INBIGDATATwomainpathstotherealizationof'BIGNESS'

Luckilybothpathsconvergetocommonprinciples&toolsthatcan

manageBIGComplexity&BIGVolume

BIGDATAENLIGHTENMENT

BIG'DATAPROBLEMS'(COMPLEXITY)singlepointoffailure/resiliencecrossdatacenterhumanfaulttolerancestore/searchunstructuredorsemi-structureddataflexibledatamodeling(e.g.traverserelationships)dataversioningpolyglotprogrammingmultitenancyshare/dataasaservicesemanticweb/multipleformats-endpoints

FlexibleOptions/Easeofoperations

'BIGDATA'PROBLEMS(VOLUME)highvolumehighvelocityreal-timeAPIs/actinrealtimedataasothersservice/dirtydatafromopensourceslogcollection/aggregation

LINEAR/HORIZONTALSCALING

IAMNOTABIGDATASTART-UP!Start-up=Growth(5%-10%)/week1000writesperaquaculturefarmperday120farmsonpublicbeta=120000writes/day1stmonth:176farms=176000writes/day6thmonth:1181farms=1.2Mwrites/day1styear:17045farms=17Mwrites/day(200/sec)2ndyear:2421143=2.4Bwrites/day(27777/sec)

AREYOUSURE?asucessfulSaaSisabigdataservice

IT'SJUSTANMVP-WEWILLADDALLTHESEBIGDATASTUFFLATER

ABigDataarchitecturecanbesimplerthanatraditionaloneTherightdatastorecanincreaseproductivityKeepitsimplebutnotcompromisethearchitecturalconceptsBalancebetweentechnicaldebt&technicalequityAnenterprisebusinesssystemwillusuallywinonunderlyingtechnologicalinnovation,robustnessandenterprisereadiness"Inbusinessthereisnothingmorevaluablethanatechnicaladvantageyourcompetitorsdon'tunderstand"-PaulGraham

KEYBIGDATAARCHITECTUREFEATURESDistributedStorage

APPLICATIONdatabasevsINTEGRATIONdatabaseMixseveraldatamodels/polyglotpersistenceExternalDataSchema/CommonDataStructuresDataStoreencapsulatedbyanAPI(DataServices)Appendonly/savechangesvsstate(eventsourcing)

KEYBIGDATAARCHITECTUREFEATURESDistributedComputing

AsynchronousprocessingRealTimeEventProcessing/StreamingSimpledecoupledservicesexposedthroughRESTorRPCAPIs(businessservices)Thickwebclients/mob.appsusingtheRESTorStreamingAPIsClient-levelmultivariatedataanalysis&complexvisualization

THELAMBDAARCHITECTUREbyNathanMarzandJamesWarren

storeraw,immutable,perpetualdata

query=function(alldata)

combinebatch&realtimestreamprocessingtocomputearbitraryfunctionsonarbitrarydata

THELAMBDAARCHITECTURE

ULTIMATEDESIGNRULE

KEEPitSIMPLE

THECONVENTIONALARCHITECTURE

auto-shard

newdatastorecriteria

Distributed

Easytochangeschema&queries

Simpletoinstall,configure,operateonecomponent

peer-to-peer

Minimizeimpedancemismatch

Boostproductivity

DIRECTLYSTOREMYAGGREGATES{"date":"2013-02-28","allocated_worker":"swp4jhi4Tm6VxY1nueX2yw","cage":"1GuuHWTaQc-kpPcRV5uBGA","feed":"7IWmy2FATcS9Vh0RB1onXQ","quantity_approved":12.5,"farm":"__uBZUr3RWOqOSkszfbRLw","species":"KDU-2LCjRRynby9HLifc3g","batch":"i6MgxixnSCGwGWb0037wlQ","execution":{"feeder":"swp4jhi4Tm6VxY1nueX2yw","quantity_fed":12.5,"species_position_start":"top","species_position_end":"middle","start":"2013-02-28T07:59:57.668Z","end":"2013-02-28T08:00:03.216Z","feeder_position_end":{"lat_lon":{"lat":37.7066959,"lon":23.16831896},"altitude":40,"accuracy":12}}}

THECANDIDATESKey-Value Document Column GraphRiak MongoDB Cassandra Neo4JRedis CouchBase HBase InfiniteGraphPr.Voldemort OrientDB Hypertable OrientDBMemcacheDB ElasticSearch Accumulo TitanDynamoDB GoogleDatastore SimpleDB Virtuoso

MYCOOLDATASTORETIPelasticsearchdocumentstore

NootherNoSQLstorecomesclosetotheoutoftheboxutilityandusabilityofElasticSearch

schemaless,multitenant,replicating&shardingdocumentstorethatimplementsextensible

&advancedsearchfeatures(geospatial,faceting,filtering,etc.)

RESTAPItoCREATE/UPDATE(partially)/DELETE/READaggregates/entities

RESTSearchAPIwithfulltextsearchoutofthebox

MULTI-TENANTfriendlywithRESTAPIforcreating/updatingDBs&entitytypes

Dynamic/Semi-Dynamic/Fixedschema

ELASTICSEARCHPOWERindexover95GB/h/node

8-nodecluster:sub-200msresponseforcomplexsearcheson10B+records

(oracleORmysql)ANDreplicationappleANDip*djohnANDcity:Dublinspecies:"SeaBream"ANDexecution.date:[20130701TO20130730]taxicubAND("Dublin"^2OR"Cork")

"facets":{"locations":{"terms":{"field":"city"}}}

"terms":[{"term":"Dublin","count":130},{"term":"Cork","count":20},{"term":"Galway","count":1}]

FACETEDBROWSING

HISTOGRAMS/GEODISTANCE"facets":{"Feed_Histogram":{"date_histogram":{"key_field":"date","value_field":"execution.quantity_fed","interval":"month"}}}

"filter":{"geo_distance_range":{"from":"200km","to":"400km""pin.location":{"lat":40,"lon":-70}}}

"filter":{"geo_polygon":{"person.location":{"points":[{"lat":40,"lon":-70},{"lat":30,"lon":-80},{"lat":20,"lon":-90}]}}}

"filter":{"geo_distance":{"distance":"200km","pin.location":{"lat":40,"lon":-70}}}

RDBMSOUT-DOCUMENTSTOREIN

WHATABOUTMYRELATIONS

LETSGOPOLYGLOT

THETITANGRAPHDBDistributedPluggablestorage(Cassandra,HBase,BerkeleyDB)IndexingwithElasticSearch&LuceneBlueprintsInterfaceGremlinQueryLanguageRexterServeraddsJSON-basedRESTinterface

EASYGRAPHTRAVERSALWITHGREMLIN//calculatebasiccollaborativefilteringforuser'Gregory'

m=[:]

g.v('name','Gregory').out('likes').in('likes').out('likes').groupCount(m)m.sort{-it.value}

STARTONASINGLEMACHINE

DATASTORESELECTIONTIPS(1)UsepolyglotpersistencewithmultipledatamodelsStartwithaDocumentStoreasyoursystemofrecordMixitwithakey-valueStoreforkeepingsessions,shoppingcart,userprefs,counters,cachingMixitwithaGraphstoretokeepandtraverseentityrelationshipsUseaColumnStoreasyoursystemofrecordifyouneedperformanceratherthanflexibilityandyouknowwellyourdatamodel&queriesKeeparelationaldbforqueriesontransientdata(reportingoninter-aggregaterelationships)

DATASTORESELECTIONTIPS(2)Preferone-componentstoresratherthanmanymovingpartsChooseastorethatmakesiteasytoexperimentwithschemaandquerychanges&supportseasydatamigrationsPreferstoresthatcanworkwithbothdynamic&fixedschemas(thereisalwaysanimplicitschema)InearlyprototypesavoidColumnstoresastheyhaveahighcostonschemaandquerychanges

DATASTORESELECTIONTIPS(3)Choosestoresthatsupportauto-shardingPreferpeer-to-peerreplicationratherthanmaster-slaveReplicationfactorN=3isagoodstandardchoiceConsistencyAdjustmentQuorum:W>N/2,W+R>N

ALLTHATSAID...APPCONTEXTisalwaysthedeterminingfactorforselecting

yourstore

aswellas...

Safety/StabilityProductivityCommunity

PerformanceTooling/Operationeaseness

DATAMODELINGTIPSRememberthatyoufityourmodeltothedatastoreandnotViceVersa(APPLICATIONvsINTEGRATIONDB)UseaSchemaBuildyouraggregatesorcolumnfamiliesaccordingtoyourusecases,i.e.DENORMALIZEperyourqueryrequirementsAggregatesformtheboundariesforACIDoperations(transactions)Pre-computeQuestionFocusedDatasets(materializedviews)toprovidedataorganizeddifferentlyfromtheirprimaryaggregates

AREWEFINISHEDYET?NOTQUITE!

Dosomethingwithourmonolithicapp

SPLITTHEMONOLITHICAPPLICATIONWrapdatastoresintoDATASERVICESCreateBUSINESSSERVICESontopofDataServicesPreferRESTfulAPIsforservices(ROA)UseaBinarySerializationFrameworktocreateRPCAPIsifperformanceisaconcern(ROA/SOA)MoveMVC*tofatmobile/webclientappsthatconsumetheAPIs

JavaScriptinthebrowserisoneoftheworld'smostwidelydistributedexecutionenvironments&Deploymentistrivial!

DECOUPLEDSERVICES

FATCLIENT

SINGLEPAGEAPP

APIFRAMEWORK/DSLclassAPI<Grape::APIversion'v1',:using=>:header,:vendor=>'aquinetix.com'default_format:jsoncontent_type:json,"application/json"content_type:tsv,"text/tab-separated-values"formatter:tsv,Aquinetix::TsvFormattercontent_type:kml,"text/xml"formatter:kml,Aquinetix::KmlFormattermountCageAPImountCageEventsAPImountDeviceAPImountFeedAPImountFeedingAPImountLossCountEventAPImountOxygenSamplingEventAPImountSigninAPImountTemperatureSamplingEventAPImountUserAPIadd_swagger_documentationmarkdown:true,base_path:"http://..."end

APIFRAMEWORK/DSLclassFeedingAPI<Grape::APIresource:feedingsdodesc'Createanewfeeding'postdoexecute_farm_obj_create_request'Feeding'enddesc'PerformaFULLorPARTIALupdateofanexistingfeeding'paramsdorequires:id,type:String,desc:"Theid(UUID)of..."optional:fields,type:String,desc:"Whichfields..."endput'/:id'doexecute_farm_obj_update_request'Feeding'enddesc'Getafeedingbyitsid(UUID)'paramsdorequires:id,:type=>String,:desc=>"Feedingid."endget'/:id'doexecute_farm_obj_instance_get_request'Feeding'endendend

SWAGGERUI

MVC*ATTHECLIENTMobileappwithbackbone.js&phonegapManagement/BIConsolewithAngularJSVisualizationwithD3.jsMultivariateDatasetAnalysisatthebrowserwithcrossfilter.jsAppworkflow&buildwithyeoman,grunt,bower

*MVP,MVVM,MVC,MVW

ASYNCHRONOUS/REALTIMEPROCESSING&STREAMINGAPI

RabbitMQ+RabbitMQWeb-StompPluginattheserver

SockJS,Stompjslibsattheclient

Real-timeeventstreamprocessingwithESPER

Alternativemessagebrokers:

node.js+zeromq

kestrel

pusher

kafka(>100kmsg/sec)

AlternativeReal-timestreamprocessing:Storm

USECASEScountratings,votes,click-throughs

blockabusivecrawlersrate-limitapis

detectspammingattemptstrackperformanceandtriggeralerts

batchprocesslogs

SUBSCRIBETOSTOMPTOPICSFROMJSws=newSockJS('http://node1.aquinetix.com:15674/stomp')@client=Stomp.over(ws)@client.connect('aquinetix','password',(x)=>@on_connect(x)@on_error,"/")on_connect:(x)->console.log"Connectedtomessagebroker"@feeding_subscr_id=@client.subscribe'/topic/feeding',(message)=>feeding=JSON.parse(message.body)Aq_Manager.events.trigger'feeding_execution:arrived',feeding

@position_subscr_id=@client.subscribe'/topic/position',(message)=>position=JSON.parse(message.body);Aq_Manager.events.trigger'worker_position:arrived',position

@client.send('/topic/feeding',{},JSON.stringify(feeding_obj))

REALTIMEEVENTPROCESSINGWITHESPERselectcount(*)astps,max(retweetCount)asmaxRetweetsfromTwitterEvent.win:time_batch(1sec)

selectfraud.accountNumberasaccntNum,fraud.warningaswarn,withdraw.amountasamount,MAX(fraud.timestamp,withdraw.timestamp)astimestamp,'withdrawlFraud'asdescfromFraudWarningEvent.win:time(30min)asfraud,WithdrawalEvent.win:time(30sec)aswithdrawwherefraud.accountNumber=withdraw.accountNumber

LOGACTIVITYANDOPERATIONALDATA

Todayacriticalpartoftheproductionfeatures

ofwebsites

Logstash+ElasticSearch+Kibana3

WRAPUPShouldavailability,robustness&scalabilitybeaddedtoyourhypotheses&valueproposition

?

ifYESthen:

Adoptanarchitecturewithdecoupledanddistributedcomponentsatearlystages.Buildyour

teamaroundit&balancetechnicaldebt/equitytoget:

Increasedteamproductivity,Increasedreadinessandagility,Sustainability

Buildyourdatamodelsaroundyourusecasesratherthanaroundyourdatabase

andexperimentwithapolyglotpersistencestrategy

Startwiththemosteasytoinstall,configure&operatetechnologies.

KeepitSIMPLE&SUSTAINABLE

LINKS/REFERENCES

http://www.rabbitmq.com/web-stomp.html

https://github.com/jmesnil/stomp-websocket/

IntroductiontoNoSQL-MartinFowlergoto;conference

MartinFowleratNoSQLMattersconference

BookontheLambdaArchitecture

TalkonLambdaArchitecture

WilliamPietri-GoingtheDistance:BuildingaSustainableStartup

Don'tLettheMinimumWinOvertheViable-HarvardBusinessReview

ElasticSearchDocumentDB&SearchEngine

CassandraColumnDB

TitanGraphDB

AstroboaSemanticDocumentStore

LINKS/REFERENCEShttps://github.com/sockjs/sockjs-client

https://github.com/robey/kestrel

https://github.com/JustinTulloss/zeromq.node

http://kafka.apache.org/index.html

https://github.com/nathanmarz/storm

https://developers.helloreverb.com/swagger/

https://github.com/wordnik/swagger-ui