DELIVERINGA'BIGDATAREADY'
MVP
GregoryChomatas
DublinGoogleDevelopersGroup-2013July30th
http://linkedin.com/in/gchomatas
SWEngineer
CHOMATASGREGORY
t:@gchomatas
http://www.astroboa.org
Entrepreneur
Betaconcept/Astroboa:Founder
Aquinetix:Co-founder/CTO
7YEARSAGOIREALIZED...
TOOMUCHRDBMSSODA
LOTSOFOBJECT-RELATIONALMISMATCH
DBISNOTTHECENTEROFMYAPPLICATION
DomainDrivenDesign/BehaviourDrivenDesign
DatabaseDrivenDesignvs
ATTHATTIMENOTMANYALTERNATIVESEXISTED
sowedecidedtorollourowndatastoresolution...
ASTROBOATOTHERESCUEHybridDocument-GraphStorefocusedondatasemantics
SimilartoGoogleDatastore&OrientDB
External'appindependent'SemanticDataModelModelasyougoSecurityperEntityinstance/propertyVersionedEntitiesAutomatedRESTAPIsencapsulatingthedatalayerHyperlinkedResourcesPolyglotPersistence(Experimental)*
*Notavailableinthepublicversion
THE"BIGNESS'INBIGDATATwomainpathstotherealizationof'BIGNESS'
Luckilybothpathsconvergetocommonprinciples&toolsthatcan
manageBIGComplexity&BIGVolume
BIGDATAENLIGHTENMENT
BIG'DATAPROBLEMS'(COMPLEXITY)singlepointoffailure/resiliencecrossdatacenterhumanfaulttolerancestore/searchunstructuredorsemi-structureddataflexibledatamodeling(e.g.traverserelationships)dataversioningpolyglotprogrammingmultitenancyshare/dataasaservicesemanticweb/multipleformats-endpoints
FlexibleOptions/Easeofoperations
'BIGDATA'PROBLEMS(VOLUME)highvolumehighvelocityreal-timeAPIs/actinrealtimedataasothersservice/dirtydatafromopensourceslogcollection/aggregation
LINEAR/HORIZONTALSCALING
IAMNOTABIGDATASTART-UP!Start-up=Growth(5%-10%)/week1000writesperaquaculturefarmperday120farmsonpublicbeta=120000writes/day1stmonth:176farms=176000writes/day6thmonth:1181farms=1.2Mwrites/day1styear:17045farms=17Mwrites/day(200/sec)2ndyear:2421143=2.4Bwrites/day(27777/sec)
AREYOUSURE?asucessfulSaaSisabigdataservice
IT'SJUSTANMVP-WEWILLADDALLTHESEBIGDATASTUFFLATER
ABigDataarchitecturecanbesimplerthanatraditionaloneTherightdatastorecanincreaseproductivityKeepitsimplebutnotcompromisethearchitecturalconceptsBalancebetweentechnicaldebt&technicalequityAnenterprisebusinesssystemwillusuallywinonunderlyingtechnologicalinnovation,robustnessandenterprisereadiness"Inbusinessthereisnothingmorevaluablethanatechnicaladvantageyourcompetitorsdon'tunderstand"-PaulGraham
KEYBIGDATAARCHITECTUREFEATURESDistributedStorage
APPLICATIONdatabasevsINTEGRATIONdatabaseMixseveraldatamodels/polyglotpersistenceExternalDataSchema/CommonDataStructuresDataStoreencapsulatedbyanAPI(DataServices)Appendonly/savechangesvsstate(eventsourcing)
KEYBIGDATAARCHITECTUREFEATURESDistributedComputing
AsynchronousprocessingRealTimeEventProcessing/StreamingSimpledecoupledservicesexposedthroughRESTorRPCAPIs(businessservices)Thickwebclients/mob.appsusingtheRESTorStreamingAPIsClient-levelmultivariatedataanalysis&complexvisualization
THELAMBDAARCHITECTUREbyNathanMarzandJamesWarren
storeraw,immutable,perpetualdata
query=function(alldata)
combinebatch&realtimestreamprocessingtocomputearbitraryfunctionsonarbitrarydata
THELAMBDAARCHITECTURE
ULTIMATEDESIGNRULE
KEEPitSIMPLE
THECONVENTIONALARCHITECTURE
auto-shard
newdatastorecriteria
Distributed
Easytochangeschema&queries
Simpletoinstall,configure,operateonecomponent
peer-to-peer
Minimizeimpedancemismatch
Boostproductivity
DIRECTLYSTOREMYAGGREGATES{"date":"2013-02-28","allocated_worker":"swp4jhi4Tm6VxY1nueX2yw","cage":"1GuuHWTaQc-kpPcRV5uBGA","feed":"7IWmy2FATcS9Vh0RB1onXQ","quantity_approved":12.5,"farm":"__uBZUr3RWOqOSkszfbRLw","species":"KDU-2LCjRRynby9HLifc3g","batch":"i6MgxixnSCGwGWb0037wlQ","execution":{"feeder":"swp4jhi4Tm6VxY1nueX2yw","quantity_fed":12.5,"species_position_start":"top","species_position_end":"middle","start":"2013-02-28T07:59:57.668Z","end":"2013-02-28T08:00:03.216Z","feeder_position_end":{"lat_lon":{"lat":37.7066959,"lon":23.16831896},"altitude":40,"accuracy":12}}}
THECANDIDATESKey-Value Document Column GraphRiak MongoDB Cassandra Neo4JRedis CouchBase HBase InfiniteGraphPr.Voldemort OrientDB Hypertable OrientDBMemcacheDB ElasticSearch Accumulo TitanDynamoDB GoogleDatastore SimpleDB Virtuoso
MYCOOLDATASTORETIPelasticsearchdocumentstore
NootherNoSQLstorecomesclosetotheoutoftheboxutilityandusabilityofElasticSearch
schemaless,multitenant,replicating&shardingdocumentstorethatimplementsextensible
&advancedsearchfeatures(geospatial,faceting,filtering,etc.)
RESTAPItoCREATE/UPDATE(partially)/DELETE/READaggregates/entities
RESTSearchAPIwithfulltextsearchoutofthebox
MULTI-TENANTfriendlywithRESTAPIforcreating/updatingDBs&entitytypes
Dynamic/Semi-Dynamic/Fixedschema
ELASTICSEARCHPOWERindexover95GB/h/node
8-nodecluster:sub-200msresponseforcomplexsearcheson10B+records
(oracleORmysql)ANDreplicationappleANDip*djohnANDcity:Dublinspecies:"SeaBream"ANDexecution.date:[20130701TO20130730]taxicubAND("Dublin"^2OR"Cork")
"facets":{"locations":{"terms":{"field":"city"}}}
"terms":[{"term":"Dublin","count":130},{"term":"Cork","count":20},{"term":"Galway","count":1}]
FACETEDBROWSING
HISTOGRAMS/GEODISTANCE"facets":{"Feed_Histogram":{"date_histogram":{"key_field":"date","value_field":"execution.quantity_fed","interval":"month"}}}
"filter":{"geo_distance_range":{"from":"200km","to":"400km""pin.location":{"lat":40,"lon":-70}}}
"filter":{"geo_polygon":{"person.location":{"points":[{"lat":40,"lon":-70},{"lat":30,"lon":-80},{"lat":20,"lon":-90}]}}}
"filter":{"geo_distance":{"distance":"200km","pin.location":{"lat":40,"lon":-70}}}
RDBMSOUT-DOCUMENTSTOREIN
WHATABOUTMYRELATIONS
LETSGOPOLYGLOT
THETITANGRAPHDBDistributedPluggablestorage(Cassandra,HBase,BerkeleyDB)IndexingwithElasticSearch&LuceneBlueprintsInterfaceGremlinQueryLanguageRexterServeraddsJSON-basedRESTinterface
EASYGRAPHTRAVERSALWITHGREMLIN//calculatebasiccollaborativefilteringforuser'Gregory'
m=[:]
g.v('name','Gregory').out('likes').in('likes').out('likes').groupCount(m)m.sort{-it.value}
STARTONASINGLEMACHINE
DATASTORESELECTIONTIPS(1)UsepolyglotpersistencewithmultipledatamodelsStartwithaDocumentStoreasyoursystemofrecordMixitwithakey-valueStoreforkeepingsessions,shoppingcart,userprefs,counters,cachingMixitwithaGraphstoretokeepandtraverseentityrelationshipsUseaColumnStoreasyoursystemofrecordifyouneedperformanceratherthanflexibilityandyouknowwellyourdatamodel&queriesKeeparelationaldbforqueriesontransientdata(reportingoninter-aggregaterelationships)
DATASTORESELECTIONTIPS(2)Preferone-componentstoresratherthanmanymovingpartsChooseastorethatmakesiteasytoexperimentwithschemaandquerychanges&supportseasydatamigrationsPreferstoresthatcanworkwithbothdynamic&fixedschemas(thereisalwaysanimplicitschema)InearlyprototypesavoidColumnstoresastheyhaveahighcostonschemaandquerychanges
DATASTORESELECTIONTIPS(3)Choosestoresthatsupportauto-shardingPreferpeer-to-peerreplicationratherthanmaster-slaveReplicationfactorN=3isagoodstandardchoiceConsistencyAdjustmentQuorum:W>N/2,W+R>N
ALLTHATSAID...APPCONTEXTisalwaysthedeterminingfactorforselecting
yourstore
aswellas...
Safety/StabilityProductivityCommunity
PerformanceTooling/Operationeaseness
DATAMODELINGTIPSRememberthatyoufityourmodeltothedatastoreandnotViceVersa(APPLICATIONvsINTEGRATIONDB)UseaSchemaBuildyouraggregatesorcolumnfamiliesaccordingtoyourusecases,i.e.DENORMALIZEperyourqueryrequirementsAggregatesformtheboundariesforACIDoperations(transactions)Pre-computeQuestionFocusedDatasets(materializedviews)toprovidedataorganizeddifferentlyfromtheirprimaryaggregates
AREWEFINISHEDYET?NOTQUITE!
Dosomethingwithourmonolithicapp
SPLITTHEMONOLITHICAPPLICATIONWrapdatastoresintoDATASERVICESCreateBUSINESSSERVICESontopofDataServicesPreferRESTfulAPIsforservices(ROA)UseaBinarySerializationFrameworktocreateRPCAPIsifperformanceisaconcern(ROA/SOA)MoveMVC*tofatmobile/webclientappsthatconsumetheAPIs
JavaScriptinthebrowserisoneoftheworld'smostwidelydistributedexecutionenvironments&Deploymentistrivial!
DECOUPLEDSERVICES
FATCLIENT
SINGLEPAGEAPP
APIFRAMEWORK/DSLclassAPI<Grape::APIversion'v1',:using=>:header,:vendor=>'aquinetix.com'default_format:jsoncontent_type:json,"application/json"content_type:tsv,"text/tab-separated-values"formatter:tsv,Aquinetix::TsvFormattercontent_type:kml,"text/xml"formatter:kml,Aquinetix::KmlFormattermountCageAPImountCageEventsAPImountDeviceAPImountFeedAPImountFeedingAPImountLossCountEventAPImountOxygenSamplingEventAPImountSigninAPImountTemperatureSamplingEventAPImountUserAPIadd_swagger_documentationmarkdown:true,base_path:"http://..."end
APIFRAMEWORK/DSLclassFeedingAPI<Grape::APIresource:feedingsdodesc'Createanewfeeding'postdoexecute_farm_obj_create_request'Feeding'enddesc'PerformaFULLorPARTIALupdateofanexistingfeeding'paramsdorequires:id,type:String,desc:"Theid(UUID)of..."optional:fields,type:String,desc:"Whichfields..."endput'/:id'doexecute_farm_obj_update_request'Feeding'enddesc'Getafeedingbyitsid(UUID)'paramsdorequires:id,:type=>String,:desc=>"Feedingid."endget'/:id'doexecute_farm_obj_instance_get_request'Feeding'endendend
SWAGGERUI
MVC*ATTHECLIENTMobileappwithbackbone.js&phonegapManagement/BIConsolewithAngularJSVisualizationwithD3.jsMultivariateDatasetAnalysisatthebrowserwithcrossfilter.jsAppworkflow&buildwithyeoman,grunt,bower
*MVP,MVVM,MVC,MVW
ASYNCHRONOUS/REALTIMEPROCESSING&STREAMINGAPI
RabbitMQ+RabbitMQWeb-StompPluginattheserver
SockJS,Stompjslibsattheclient
Real-timeeventstreamprocessingwithESPER
Alternativemessagebrokers:
node.js+zeromq
kestrel
pusher
kafka(>100kmsg/sec)
AlternativeReal-timestreamprocessing:Storm
USECASEScountratings,votes,click-throughs
blockabusivecrawlersrate-limitapis
detectspammingattemptstrackperformanceandtriggeralerts
batchprocesslogs
SUBSCRIBETOSTOMPTOPICSFROMJSws=newSockJS('http://node1.aquinetix.com:15674/stomp')@client=Stomp.over(ws)@client.connect('aquinetix','password',(x)=>@on_connect(x)@on_error,"/")on_connect:(x)->console.log"Connectedtomessagebroker"@[email protected]'/topic/feeding',(message)=>feeding=JSON.parse(message.body)Aq_Manager.events.trigger'feeding_execution:arrived',feeding
@[email protected]'/topic/position',(message)=>position=JSON.parse(message.body);Aq_Manager.events.trigger'worker_position:arrived',position
@client.send('/topic/feeding',{},JSON.stringify(feeding_obj))
REALTIMEEVENTPROCESSINGWITHESPERselectcount(*)astps,max(retweetCount)asmaxRetweetsfromTwitterEvent.win:time_batch(1sec)
selectfraud.accountNumberasaccntNum,fraud.warningaswarn,withdraw.amountasamount,MAX(fraud.timestamp,withdraw.timestamp)astimestamp,'withdrawlFraud'asdescfromFraudWarningEvent.win:time(30min)asfraud,WithdrawalEvent.win:time(30sec)aswithdrawwherefraud.accountNumber=withdraw.accountNumber
LOGACTIVITYANDOPERATIONALDATA
Todayacriticalpartoftheproductionfeatures
ofwebsites
Logstash+ElasticSearch+Kibana3
WRAPUPShouldavailability,robustness&scalabilitybeaddedtoyourhypotheses&valueproposition
?
ifYESthen:
Adoptanarchitecturewithdecoupledanddistributedcomponentsatearlystages.Buildyour
teamaroundit&balancetechnicaldebt/equitytoget:
Increasedteamproductivity,Increasedreadinessandagility,Sustainability
Buildyourdatamodelsaroundyourusecasesratherthanaroundyourdatabase
andexperimentwithapolyglotpersistencestrategy
Startwiththemosteasytoinstall,configure&operatetechnologies.
KeepitSIMPLE&SUSTAINABLE
LINKS/REFERENCES
http://www.rabbitmq.com/web-stomp.html
https://github.com/jmesnil/stomp-websocket/
IntroductiontoNoSQL-MartinFowlergoto;conference
MartinFowleratNoSQLMattersconference
BookontheLambdaArchitecture
TalkonLambdaArchitecture
WilliamPietri-GoingtheDistance:BuildingaSustainableStartup
Don'tLettheMinimumWinOvertheViable-HarvardBusinessReview
ElasticSearchDocumentDB&SearchEngine
CassandraColumnDB
TitanGraphDB
AstroboaSemanticDocumentStore
LINKS/REFERENCEShttps://github.com/sockjs/sockjs-client
https://github.com/robey/kestrel
https://github.com/JustinTulloss/zeromq.node
http://kafka.apache.org/index.html
https://github.com/nathanmarz/storm
https://developers.helloreverb.com/swagger/
https://github.com/wordnik/swagger-ui