View
395
Download
3
Category
Tags:
Preview:
DESCRIPTION
In most cases talking about big data follows an "a posteriori" view where an organization overwhelmed by huge amounts of log files and numerous data sources scattered among its departments decides to put some order to the mess and get some value out of the "big data", usually building a Hadoop cluster. In this presentation I take the opposite direction and try to demonstrate how to proactively design and build product architectures that manage to remain simple and lean while at the same time anticipate the big data complexities and solve them easily and elegantly from day one.
Citation preview
DELIVERINGA'BIGDATAREADY'
MVP
GregoryChomatas
DublinGoogleDevelopersGroup-2013July30th
http://linkedin.com/in/gchomatas
SWEngineer
CHOMATASGREGORY
t:@gchomatas
http://www.astroboa.org
Entrepreneur
Betaconcept/Astroboa:Founder
Aquinetix:Co-founder/CTO
7YEARSAGOIREALIZED...
TOOMUCHRDBMSSODA
LOTSOFOBJECT-RELATIONALMISMATCH
DBISNOTTHECENTEROFMYAPPLICATION
DomainDrivenDesign/BehaviourDrivenDesign
DatabaseDrivenDesignvs
ATTHATTIMENOTMANYALTERNATIVESEXISTED
sowedecidedtorollourowndatastoresolution...
ASTROBOATOTHERESCUEHybridDocument-GraphStorefocusedondatasemantics
SimilartoGoogleDatastore&OrientDB
External'appindependent'SemanticDataModelModelasyougoSecurityperEntityinstance/propertyVersionedEntitiesAutomatedRESTAPIsencapsulatingthedatalayerHyperlinkedResourcesPolyglotPersistence(Experimental)*
*Notavailableinthepublicversion
THE"BIGNESS'INBIGDATATwomainpathstotherealizationof'BIGNESS'
Luckilybothpathsconvergetocommonprinciples&toolsthatcan
manageBIGComplexity&BIGVolume
BIGDATAENLIGHTENMENT
BIG'DATAPROBLEMS'(COMPLEXITY)singlepointoffailure/resiliencecrossdatacenterhumanfaulttolerancestore/searchunstructuredorsemi-structureddataflexibledatamodeling(e.g.traverserelationships)dataversioningpolyglotprogrammingmultitenancyshare/dataasaservicesemanticweb/multipleformats-endpoints
FlexibleOptions/Easeofoperations
'BIGDATA'PROBLEMS(VOLUME)highvolumehighvelocityreal-timeAPIs/actinrealtimedataasothersservice/dirtydatafromopensourceslogcollection/aggregation
LINEAR/HORIZONTALSCALING
IAMNOTABIGDATASTART-UP!Start-up=Growth(5%-10%)/week1000writesperaquaculturefarmperday120farmsonpublicbeta=120000writes/day1stmonth:176farms=176000writes/day6thmonth:1181farms=1.2Mwrites/day1styear:17045farms=17Mwrites/day(200/sec)2ndyear:2421143=2.4Bwrites/day(27777/sec)
AREYOUSURE?asucessfulSaaSisabigdataservice
IT'SJUSTANMVP-WEWILLADDALLTHESEBIGDATASTUFFLATER
ABigDataarchitecturecanbesimplerthanatraditionaloneTherightdatastorecanincreaseproductivityKeepitsimplebutnotcompromisethearchitecturalconceptsBalancebetweentechnicaldebt&technicalequityAnenterprisebusinesssystemwillusuallywinonunderlyingtechnologicalinnovation,robustnessandenterprisereadiness"Inbusinessthereisnothingmorevaluablethanatechnicaladvantageyourcompetitorsdon'tunderstand"-PaulGraham
KEYBIGDATAARCHITECTUREFEATURESDistributedStorage
APPLICATIONdatabasevsINTEGRATIONdatabaseMixseveraldatamodels/polyglotpersistenceExternalDataSchema/CommonDataStructuresDataStoreencapsulatedbyanAPI(DataServices)Appendonly/savechangesvsstate(eventsourcing)
KEYBIGDATAARCHITECTUREFEATURESDistributedComputing
AsynchronousprocessingRealTimeEventProcessing/StreamingSimpledecoupledservicesexposedthroughRESTorRPCAPIs(businessservices)Thickwebclients/mob.appsusingtheRESTorStreamingAPIsClient-levelmultivariatedataanalysis&complexvisualization
THELAMBDAARCHITECTUREbyNathanMarzandJamesWarren
storeraw,immutable,perpetualdata
query=function(alldata)
combinebatch&realtimestreamprocessingtocomputearbitraryfunctionsonarbitrarydata
THELAMBDAARCHITECTURE
ULTIMATEDESIGNRULE
KEEPitSIMPLE
THECONVENTIONALARCHITECTURE
auto-shard
newdatastorecriteria
Distributed
Easytochangeschema&queries
Simpletoinstall,configure,operateonecomponent
peer-to-peer
Minimizeimpedancemismatch
Boostproductivity
DIRECTLYSTOREMYAGGREGATES{"date":"2013-02-28","allocated_worker":"swp4jhi4Tm6VxY1nueX2yw","cage":"1GuuHWTaQc-kpPcRV5uBGA","feed":"7IWmy2FATcS9Vh0RB1onXQ","quantity_approved":12.5,"farm":"__uBZUr3RWOqOSkszfbRLw","species":"KDU-2LCjRRynby9HLifc3g","batch":"i6MgxixnSCGwGWb0037wlQ","execution":{"feeder":"swp4jhi4Tm6VxY1nueX2yw","quantity_fed":12.5,"species_position_start":"top","species_position_end":"middle","start":"2013-02-28T07:59:57.668Z","end":"2013-02-28T08:00:03.216Z","feeder_position_end":{"lat_lon":{"lat":37.7066959,"lon":23.16831896},"altitude":40,"accuracy":12}}}
THECANDIDATESKey-Value Document Column GraphRiak MongoDB Cassandra Neo4JRedis CouchBase HBase InfiniteGraphPr.Voldemort OrientDB Hypertable OrientDBMemcacheDB ElasticSearch Accumulo TitanDynamoDB GoogleDatastore SimpleDB Virtuoso
MYCOOLDATASTORETIPelasticsearchdocumentstore
NootherNoSQLstorecomesclosetotheoutoftheboxutilityandusabilityofElasticSearch
schemaless,multitenant,replicating&shardingdocumentstorethatimplementsextensible
&advancedsearchfeatures(geospatial,faceting,filtering,etc.)
RESTAPItoCREATE/UPDATE(partially)/DELETE/READaggregates/entities
RESTSearchAPIwithfulltextsearchoutofthebox
MULTI-TENANTfriendlywithRESTAPIforcreating/updatingDBs&entitytypes
Dynamic/Semi-Dynamic/Fixedschema
ELASTICSEARCHPOWERindexover95GB/h/node
8-nodecluster:sub-200msresponseforcomplexsearcheson10B+records
(oracleORmysql)ANDreplicationappleANDip*djohnANDcity:Dublinspecies:"SeaBream"ANDexecution.date:[20130701TO20130730]taxicubAND("Dublin"^2OR"Cork")
"facets":{"locations":{"terms":{"field":"city"}}}
"terms":[{"term":"Dublin","count":130},{"term":"Cork","count":20},{"term":"Galway","count":1}]
FACETEDBROWSING
HISTOGRAMS/GEODISTANCE"facets":{"Feed_Histogram":{"date_histogram":{"key_field":"date","value_field":"execution.quantity_fed","interval":"month"}}}
"filter":{"geo_distance_range":{"from":"200km","to":"400km""pin.location":{"lat":40,"lon":-70}}}
"filter":{"geo_polygon":{"person.location":{"points":[{"lat":40,"lon":-70},{"lat":30,"lon":-80},{"lat":20,"lon":-90}]}}}
"filter":{"geo_distance":{"distance":"200km","pin.location":{"lat":40,"lon":-70}}}
RDBMSOUT-DOCUMENTSTOREIN
WHATABOUTMYRELATIONS
LETSGOPOLYGLOT
THETITANGRAPHDBDistributedPluggablestorage(Cassandra,HBase,BerkeleyDB)IndexingwithElasticSearch&LuceneBlueprintsInterfaceGremlinQueryLanguageRexterServeraddsJSON-basedRESTinterface
EASYGRAPHTRAVERSALWITHGREMLIN//calculatebasiccollaborativefilteringforuser'Gregory'
m=[:]
g.v('name','Gregory').out('likes').in('likes').out('likes').groupCount(m)m.sort{-it.value}
STARTONASINGLEMACHINE
DATASTORESELECTIONTIPS(1)UsepolyglotpersistencewithmultipledatamodelsStartwithaDocumentStoreasyoursystemofrecordMixitwithakey-valueStoreforkeepingsessions,shoppingcart,userprefs,counters,cachingMixitwithaGraphstoretokeepandtraverseentityrelationshipsUseaColumnStoreasyoursystemofrecordifyouneedperformanceratherthanflexibilityandyouknowwellyourdatamodel&queriesKeeparelationaldbforqueriesontransientdata(reportingoninter-aggregaterelationships)
DATASTORESELECTIONTIPS(2)Preferone-componentstoresratherthanmanymovingpartsChooseastorethatmakesiteasytoexperimentwithschemaandquerychanges&supportseasydatamigrationsPreferstoresthatcanworkwithbothdynamic&fixedschemas(thereisalwaysanimplicitschema)InearlyprototypesavoidColumnstoresastheyhaveahighcostonschemaandquerychanges
DATASTORESELECTIONTIPS(3)Choosestoresthatsupportauto-shardingPreferpeer-to-peerreplicationratherthanmaster-slaveReplicationfactorN=3isagoodstandardchoiceConsistencyAdjustmentQuorum:W>N/2,W+R>N
ALLTHATSAID...APPCONTEXTisalwaysthedeterminingfactorforselecting
yourstore
aswellas...
Safety/StabilityProductivityCommunity
PerformanceTooling/Operationeaseness
DATAMODELINGTIPSRememberthatyoufityourmodeltothedatastoreandnotViceVersa(APPLICATIONvsINTEGRATIONDB)UseaSchemaBuildyouraggregatesorcolumnfamiliesaccordingtoyourusecases,i.e.DENORMALIZEperyourqueryrequirementsAggregatesformtheboundariesforACIDoperations(transactions)Pre-computeQuestionFocusedDatasets(materializedviews)toprovidedataorganizeddifferentlyfromtheirprimaryaggregates
AREWEFINISHEDYET?NOTQUITE!
Dosomethingwithourmonolithicapp
SPLITTHEMONOLITHICAPPLICATIONWrapdatastoresintoDATASERVICESCreateBUSINESSSERVICESontopofDataServicesPreferRESTfulAPIsforservices(ROA)UseaBinarySerializationFrameworktocreateRPCAPIsifperformanceisaconcern(ROA/SOA)MoveMVC*tofatmobile/webclientappsthatconsumetheAPIs
JavaScriptinthebrowserisoneoftheworld'smostwidelydistributedexecutionenvironments&Deploymentistrivial!
DECOUPLEDSERVICES
FATCLIENT
SINGLEPAGEAPP
APIFRAMEWORK/DSLclassAPI<Grape::APIversion'v1',:using=>:header,:vendor=>'aquinetix.com'default_format:jsoncontent_type:json,"application/json"content_type:tsv,"text/tab-separated-values"formatter:tsv,Aquinetix::TsvFormattercontent_type:kml,"text/xml"formatter:kml,Aquinetix::KmlFormattermountCageAPImountCageEventsAPImountDeviceAPImountFeedAPImountFeedingAPImountLossCountEventAPImountOxygenSamplingEventAPImountSigninAPImountTemperatureSamplingEventAPImountUserAPIadd_swagger_documentationmarkdown:true,base_path:"http://..."end
APIFRAMEWORK/DSLclassFeedingAPI<Grape::APIresource:feedingsdodesc'Createanewfeeding'postdoexecute_farm_obj_create_request'Feeding'enddesc'PerformaFULLorPARTIALupdateofanexistingfeeding'paramsdorequires:id,type:String,desc:"Theid(UUID)of..."optional:fields,type:String,desc:"Whichfields..."endput'/:id'doexecute_farm_obj_update_request'Feeding'enddesc'Getafeedingbyitsid(UUID)'paramsdorequires:id,:type=>String,:desc=>"Feedingid."endget'/:id'doexecute_farm_obj_instance_get_request'Feeding'endendend
SWAGGERUI
MVC*ATTHECLIENTMobileappwithbackbone.js&phonegapManagement/BIConsolewithAngularJSVisualizationwithD3.jsMultivariateDatasetAnalysisatthebrowserwithcrossfilter.jsAppworkflow&buildwithyeoman,grunt,bower
*MVP,MVVM,MVC,MVW
ASYNCHRONOUS/REALTIMEPROCESSING&STREAMINGAPI
RabbitMQ+RabbitMQWeb-StompPluginattheserver
SockJS,Stompjslibsattheclient
Real-timeeventstreamprocessingwithESPER
Alternativemessagebrokers:
node.js+zeromq
kestrel
pusher
kafka(>100kmsg/sec)
AlternativeReal-timestreamprocessing:Storm
USECASEScountratings,votes,click-throughs
blockabusivecrawlersrate-limitapis
detectspammingattemptstrackperformanceandtriggeralerts
batchprocesslogs
SUBSCRIBETOSTOMPTOPICSFROMJSws=newSockJS('http://node1.aquinetix.com:15674/stomp')@client=Stomp.over(ws)@client.connect('aquinetix','password',(x)=>@on_connect(x)@on_error,"/")on_connect:(x)->console.log"Connectedtomessagebroker"@feeding_subscr_id=@client.subscribe'/topic/feeding',(message)=>feeding=JSON.parse(message.body)Aq_Manager.events.trigger'feeding_execution:arrived',feeding
@position_subscr_id=@client.subscribe'/topic/position',(message)=>position=JSON.parse(message.body);Aq_Manager.events.trigger'worker_position:arrived',position
@client.send('/topic/feeding',{},JSON.stringify(feeding_obj))
REALTIMEEVENTPROCESSINGWITHESPERselectcount(*)astps,max(retweetCount)asmaxRetweetsfromTwitterEvent.win:time_batch(1sec)
selectfraud.accountNumberasaccntNum,fraud.warningaswarn,withdraw.amountasamount,MAX(fraud.timestamp,withdraw.timestamp)astimestamp,'withdrawlFraud'asdescfromFraudWarningEvent.win:time(30min)asfraud,WithdrawalEvent.win:time(30sec)aswithdrawwherefraud.accountNumber=withdraw.accountNumber
LOGACTIVITYANDOPERATIONALDATA
Todayacriticalpartoftheproductionfeatures
ofwebsites
Logstash+ElasticSearch+Kibana3
WRAPUPShouldavailability,robustness&scalabilitybeaddedtoyourhypotheses&valueproposition
?
ifYESthen:
Adoptanarchitecturewithdecoupledanddistributedcomponentsatearlystages.Buildyour
teamaroundit&balancetechnicaldebt/equitytoget:
Increasedteamproductivity,Increasedreadinessandagility,Sustainability
Buildyourdatamodelsaroundyourusecasesratherthanaroundyourdatabase
andexperimentwithapolyglotpersistencestrategy
Startwiththemosteasytoinstall,configure&operatetechnologies.
KeepitSIMPLE&SUSTAINABLE
LINKS/REFERENCES
http://www.rabbitmq.com/web-stomp.html
https://github.com/jmesnil/stomp-websocket/
IntroductiontoNoSQL-MartinFowlergoto;conference
MartinFowleratNoSQLMattersconference
BookontheLambdaArchitecture
TalkonLambdaArchitecture
WilliamPietri-GoingtheDistance:BuildingaSustainableStartup
Don'tLettheMinimumWinOvertheViable-HarvardBusinessReview
ElasticSearchDocumentDB&SearchEngine
CassandraColumnDB
TitanGraphDB
AstroboaSemanticDocumentStore
LINKS/REFERENCEShttps://github.com/sockjs/sockjs-client
https://github.com/robey/kestrel
https://github.com/JustinTulloss/zeromq.node
http://kafka.apache.org/index.html
https://github.com/nathanmarz/storm
https://developers.helloreverb.com/swagger/
https://github.com/wordnik/swagger-ui
Recommended