TheChubbylockservicefor
loosely‐coupleddistributedsystems
PresentedbyPetkoNikolovCornellUniversity
3/12/09
BeforeChubbyCameAbout…
• Widerangeofdistributedsystems• Clientsinthe10,000s• HowtodoprimaryelecJon?– Adhoc(noharmfromduplicatedwork)– OperatorintervenJon(correctnessessenJal)
• Disorganized• Costly• Lowavailability
MoJvaJon
• Needforservicethatprovides– synchronizaJon(leaderelecJon,sharedenv.info.)– reliability– availability– easy‐to‐understandsemanJcs
– performance,throughput,latencyonlysecondary
• NOTresearch
Outline
• PrimaryElecJon– Paxos
• Design• UseandObservaJons• RelatedWork
PrimaryElecJon
• Distributedconsensusproblem• AsynchronouscommunicaJon– loss,delay,reordering
• FLPimpossibilityresult
• SoluJon:Paxosprotocol
Paxos:Problem
• CollecJonofprocessesproposingvalues– onlyproposedvaluemaybechosen
– onlysinglevaluechosen– learnofchosenvalueonlywhenithasbeen
• Proposers,acceptors,learners• Asynchronous,non‐ByzanJnemodel– arbitraryspeeds,failbystopping,restart– messagesnotcorrupted
Paxos:Algorithm
Phase1(a)Proposersendspreparerequestwith#n(b)Acceptor:ifn>#ofanyotherprepareit
has repliedto,respondwithpromise.Phase2(a) Ifmajorityreply,proposersendsaccept
with valuev(b) Acceptoracceptsunlessitrespondedto
preparewith#higherthann
Paxos:Algorithm
Paxos:Algorithm
• Learningofchosenvalue– disJnguishedlearneropJmizaJon
• haspi_alls
• Makingprogress– disJnguishedproposer
• Usually,everyprocessplaysallroles– primaryasdisJnguishedproposerandlearner
Paxos:StateMachines
• Replicatedstatemachine– samestateifsamesequenceofops.performed
• Clientsendsrequeststoserver– replicatedwithPaxos
• Paxosusedtoagreeonorderofclientops.– canhavefailures/morethan1master
– Paxosguaranteesonly1valuechosen&replicated
Paxos:ViewChange
Design
• Lockservice(andnotconsensuslibrary)• Servesmallfiles
• Supportlarge‐scaleconcurrentfileviewing• EventnoJficaJonmechanism
• Cachingoffiles(consistentcaching)• Security(accesscontrol)• Course‐grainedlocks
Design:RaJonale
• Lockservicevs.Paxoslibrary• Advantages– maintainprogramstructure,comm.pacers
– mechanismforadverJsingresults– persuadingprogrammerstouseit– reduce#ofclientserversneededtomakeprogress
Design:RaJonale
• Course‐grainedlocks– lessloadonlockserver– lessdelaywhenlockserverfails– shouldsurvivelockserverfailures– lesslockserversandavailabilityrequired
• Fine‐grainedlocks– heavierlockserverload,moreclientstallingonfail– canbeimplementedonclientside
Design:SystemStructure
• Twomaincomponents:– server(Chubbycell)– clientlibrary– communicateviaRPC
• Proxy– opJonal– moreonthislater
Design:ChubbyCell
• Setofreplicas(typically5)• UsePaxostoelectmaster– promisenottoelectnewmasterforsomeJme(masterlease)
• Maintaincopiesofsimpledatabase
• WritessaJsfiedbymajorityquorum
• ReadssaJsfiedbymasteralone
• Replacementsystemforfailedreplicas
Design:ChubbyClients
• Linkagainstlibrary• MasterlocaJonrequeststoreplicas
• Allrequestssentdirectlytomaster
Design:Files,Dirs,Handles
• FSinterface– /ls/cs6464‐cell/lab2/test– specializedAPI– alsoviainterfaceusedbyGFS
• Doesnotsupport/maintain/reveal– movingfiles
– path‐dependentpermissionsemanJcs– dirmodifiedJmes/filelast‐accessJmes
Design:Nodes
• permanentvs.ephemeral• Metadata– threenamesofACLs(R/W/changeACLname)
• authenJcaJonbuiltintoRPC– 4monotonicallyincreasing64‐bitnumbers
• instance,contentgeneraJon,lockgeneraJon,ACLgen.– 64‐bitfile‐contentchecksum
Design:Handles
• AnalogoustoUNIXfiledescriptors• Checkdigits– preventclientcreaJng/guessinghandles
• Supportforuseacrossmasterchanges– sequencenumber– modeinformaJonforrecreaJngstate
Design:LocksandSequencers
• Anynodecanactaslock(sharedorexclusive)• Advisory(vs.mandatory)– protectresourcesatremoteservices
– debugging/admin.purposes– novalueinextraguardsbymandatorylocks
• Writepermissionneededtoacquire– preventsunprivilegedreaderblockingprogress
Design:LocksandSequencers
• Complexinasyncenvironment• Usesequence#’sininteracJonsusinglocks• Sequencer– opaquebyte‐string– stateoflockimmediatelyajeracquisiJon– passedbyclienttoservers,serversvalidate
• AlternaJve:lock‐delay
Design:Events
• ClientsubscribeswhencreaJnghandle• Deliveredasyncviaup‐callfromclientlibrary• Eventtypes– filecontentsmodified– childnodeadded/removed/modified– Chubbymasterfailedover– handle/lockhavebecomeinvalid– lockacquired/conflicJnglockrequest(rarelyused)
Design:API
• Open()(onlycallusingnamednode)– howhandlewillbeused(accesscheckshere)– eventstosubscribeto– lock‐delay– whethernewfile/dirshouldbecreated
• Close()vs.Poison()• Otherops:
– GetContentsAndStat(),SetContents(),Delete(),Acquire(),TryAcquire(),Release(),GetSequencer(),SetSequencer(),CheckSequencer()
Design:API
• PrimaryelecJonexample• Candidatesacempttoopenlockfile/getlock– winnerwritesidenJtywithSetContents()– replicasfindoutwithGetContentsAndStat(),possiblyajerfile‐modificaJonevent
• Primaryobtainssequencer(GetSequencer())
Design:SessionsandKeepAlives
• SessionmaintainedthroughKeepAlives• Handles,locks,cacheddataremainvalid– clientmustacknowledgeinvalidaJonmessages
• Terminatedexplicitly,orajerleaseJmeout
• LeaseJmeoutadvancedwhen– sessioncreated– masterfail‐overoccurs
– masterrespondstoKeepAliveRPC
Design:SessionsandKeepAlives
• MasterrespondsclosetoleaseJmeout• ClientsendsanotherKeepAliveimmediately
Design:SessionsandKeepAlives
• Handles,locks,cacheddataremainvalid– clientmustacknowledgeinvalidaJonmessages
• CacheinvalidaJonspiggybackedonKeepAlive– clientmustinvalidatetomaintainsession– RPC’sflowfromclienttomaster– allowsoperaJonthroughfirewalls
Design:SessionsandKeepAlives
• ClientmaintainslocalleaseJmeout– conservaJveapproximaJon
– mustassumeknownrestricJonsonclockskew
• Whenlocalleaseexpires– disablecache– sessioninjeopardy,clientwaitsingraceperiod– cacheenabledonreconnect
• ApplicaJoninformedaboutsessionchanges
Design:Caching
• Clientcachesfiledata,nodemeta‐data– write‐throughheldinmemory
• InvalidaJon– masterkeepslistofwhatclientsmayhavecached– writesblock,mastersendsinvalidaJons– clientsflushchangeddata,ack.withKeepAlive– datauncachableunJlinvalidaJonacked
• allowsreadstohappenwithoutdelay
Design:Caching
• Invalidatesdatabutdoesnotupdate– updaJngarbitrarilyunefficient
• Strictvs.weakconsistency– weakermodelshardertouseforprogrammers– donotwanttoalterpreexisJngcomm.protocols
• Handlesandlockscachedaswell– eventinformsclientofconflicJnglockrequest
• Absenceoffilescached
Design:Fail‐overs
• In‐memorystatediscarded– sessions,handles,locks,etc.
• LeaseJmer“stops”
• Quickre‐elecJon– clientreconnectbeforeleasesexpire
• Slowre‐elecJon– clientsdisablecache,entergraceperiod– allowssessionsacrossfail‐overs
Design:Fail‐overs
Design:Fail‐overs
Stepsofnewly‐electedmaster:1. Picknewepochnumber
2. RespondonlytomasterlocaJonrequests3. Buildin‐memorystateforsessions/locksfromDB4. RespondtoKeepAlives5. Emitfail‐overeventstocaches6. Waitforacknowledgements/sessionexpire7. AllowalloperaJonstoproceed
Design:Fail‐overs
Stepsofnewly‐electedmaster(cont’d):8. Handlecreatedpre‐fail‐overused– masterrecreatesinmemory,honorscall
– ifclosed,recordthatinmemory
10. Deleteephemeralfilesw/oopenhandlesajeraninterval
• Fail‐overcodesourceofmanybugs
Design:Database
• FirstChubbyusedreplicatedBerkeleyDB– withmasterleaseaddedon
• ReplicaJoncodewasnew– didnotwanttotaketherisk
• Implementedownsimpledatabase– distributedusingconsensusprotocol
Design:Backup
• Everyfewhours• SnapshotofdatabasetoGFSserver– differentbuilding
• buildingdamage,cyclicdependecies
• Disasterrecovery• IniJalizenewreplica– avoidloadonin‐servicereplicas
Design:Mirroring
• CollecJonoffilesmirroredacrosscells• MostlyforconfiguraJonfiles– /ls/global/mastermirroredto/ls/cell/slave
• globalcell’sreplicasspreadaroundworld– Chubby’sownACLs– FilesadverJsingpresence/locaJon– pointerstoBigtablecells– etc.
MechanismsforScaling
• Clientsindividualprocesses(notmachines)– observed90,000clientsforasinglemaster
• ServermachinesidenJcaltoclientones• MosteffecJvescaling:reducecommunicaJon• Regulate#ofChubbycells• IncreaseleaseJme• Caching• Protocol‐conversionservers
Scaling:Proxies
• Proxiespassrequestsfromclientstocell• CanhandleKeepAlivesandreads• Notwrites,buttheyare<<1%ofworkload• KeepAlivetrafficbyfarmostdominant
• Disadvantages:– addiJonalRPCforwrites/firstJmereads– increasedunavailabilityprobability– fail‐overstrategynotideal(willcomebacktothis)
Scaling:ParJJoning
• NamespaceparJJonedbetweenservers• NparJJons,eachwithmasterandreplicas
• NodeD/CstoredonP(D/C)=hash(D)modN– meta‐dataforDmaybeondifferentparJJon
• Liclecross‐parJJoncomm.desirable– permissionchecks– directorydeleJon– cachinghelpsmiJgatethis
UseandObservaJons
• Manyfilesfornaming
• Config,ACL,meta‐datacommon
• 10clientsuseeachcachedfile,onavg.
• Fewlocksheld,nosharedlocks
• KeepAlivesdominateRPCtraffic
Use:Outages
• Sampleofcells– 61outagesoverfewweeks(700cell‐days)– duetonetworkcongesJon,maintenance,overload,errorsinsojware,hardware,operators
• 52outagesunder30s – applicaJonsnotsignificantlyaffected
• Fewdozencell‐yearsofoperaJon– dataloston6occasions(bugs&operatorerror)
Use:JavaClients
• MostofGoogleinfrastructureisinC++• Growing#ofJavaapplicaJons• GooglersdislikeJNI– wouldrathertranslatelibrarytoJava– maintainingitwouldrequiregreatexpense
• Javausersrunprotocol‐conversionserver– exportsprotocolsimilartoChubby’sclientAPI
Use:NameService
• MostpopularuseofChubby– providesnameserviceformostGooglesystems
• DNSusesTTLvalues– entriesmustberefreshedwithinthatJme– huge(andvariable)loadonDNSserver
• Chubby’scachingusesinvalidaJons,nopolling– clientbuildsupneededentriesincache– nameentriesfurthergroupedinbatches
Use:NameService
• Nameservice– nofullconsistencyneeded– reduceloadwithprotocol‐conversionserver
• ChubbyDNSserver– namingdataavailabletoDNSclients– easestransiJonbetweennames
– accommodatesbrowsers
Use:Fail‐overProblems
• MasterwritessessionstoDBwhencreated– startofmanyprocessesatonce=overload
• DBmodified–storesessionatfirstwriteop.– read‐onlysessions:atrandomonKeepAlives– spreadoutwritestoDBinJme
• Youngread‐onlysessionsmaybe“discarded”– mayreadstaledataforawhileajerfail‐over– verylowprobability
Use:Fail‐overProblems
• Newdesign–nosessionsindatabase– recreatethemlikehandlesajerfail‐over
– newmasterwaitsfullleaseJmebeforeops.• licleeffect–verylowprobability
• Proxyserverscanmanagesessions– allowedtochangesessionalockisassociatedwith
• permitstakeoverofsessionbyanotherproxyonfail
– mastergivesnewproxychancetoclaimlocksbeforerelinquishingthem
Use:AbusiveClients
• Companyenvironmentassumed• RequeststouseChubbythoroughlyreviewed• Abuses:– lackofaggressivecaching
• absenceoffiles,openfilehandles– lackofquotas
• 256kBlimitonfilesizeintroduced• encourageduseofappropriatestoragesystems
– publish/subscribe
Use:LessonsLearned
• Developersrarelyconsideravailability– shouldplanforshortChubbyoutages– crashedapplicaJonsonfail‐overevent
• Fine‐grainedlockingnotessenJal• PoorAPIchoices– handlesacquiringlockscannotbeshared
• RPCuseaffectstransportprotocols– forcedtosendKeepAlivesbyUDPforJmeliness
RelatedWork
Chubby• locks,storagesystem,
session/leaseinoneservice• targetaudience–wide
range• higher‐levelinterface• lostlockexpensivefor
clients• coulduselocksand
sequencerswithothersystems
Boxwood
• 3separateservices– lock,Paxos,failuredetecJon– couldbeusedindependently
• fewer,moresophisJcateddevelopers
• differentdefaultparameters• lacksgraceperiod• useslocksprimarilywithin
Summary
• Distributedlockservice– course‐grainedsynchronizaJonforGoogle’sdistributedsystems
• Designbasedonwell‐knownideas– distributedconsensus,caching,noJficaJons,file‐systeminterface
• Primaryinternalnameservice
• Repositoryforfilesrequiringhighavailability
References
TheChubbylockserviceforloosely‐coupleddistributedsystems,MikeBurrows.AppearsinProceedingsofthe7thUSENIXSymposiumonOperaJngSystemsDesignandImplementaJon(OSDI),November,2006.hcp://labs.google.com/papers/chubby‐osdi06.pdf
PaxosMadeSimple,LeslieLamport.AppearsinACMSIGACTNews(DistributedCompuJngColumn),Vol.32,No.4(December2001),pages51‐58.hcp://research.microsoj.com/en‐us/um/people/lamport/pubs/paxos‐simple.pdfAlso,
PaxosMadePracJcalbyDavidMaziereshcp://www.cs.cornell.edu/courses/cs6464/2009sp/papers/paxos_pracJcal.pdf