Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFS

ArvindKrishnamurthy(basedonslidesfromTomAnderson

&DanPorts)

GoogleStack

– GFS:large-scalestorageforbulkdata– Chubby:Paxosstorageforcoordination– BigTable:semi-structureddatastorage–MapReduce:bigdatacomputationonkey-valuepairs

–MegaStore,Spanner:transactionalstoragewithgeo-replication

GFS

• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex

• WhynotuseNFS?(Thatis,someexistingdistributedfilesystem.)

GFS

• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex

• WhynotuseNFS?– verydifferentworkloadcharacteristics!– designGFSforGoogleapps,Googleappsfor GFS

• Requirements:– Faulttolerance,availability,throughput,scale– Concurrentstreamingreadsandwrites

GFSWorkload

• Producer/consumer– Hundredsofwebcrawlingclients– PeriodicbatchanalyticjobslikeMapReduce– Throughput,notlatency

• Bigdatasets(forthetime):– 1000servers,300TBofdatastored

• Later:BigTable tabletlogandSSTables• Evenlater:Workloadnow?

GFSWorkload

• Fewmillion100MB+files–Many arehuge

• Reads:–Mostlylargestreamingreads– Somesortedrandomreads

• Writes:–Mostfileswrittenonce,neverupdated–Mostwritesareappends,e.g.,concurrentworkers

GFSInterface

• app-levellibrary– notakernelfilesystem– NotaPOSIXfilesystem

• create,delete,open,close,read,write,append–Metadataoperationsarelinearizable– Filedata eventuallyconsistent (stalereads)

• Inexpensivefile,directorysnapshots

Lifewithoutrandomwrites

• Results of a previous crawl:www.page1.com -> www.my.blogspot.comwww.page2.com -> www.my.blogspot.com• New results: page2 no longer has the link, but there is a new

page, page3:www.page1.com -> www.my.blogspot.comwww.page3.com -> www.my.blogspot.com• Option: delete old record (page2); insert new record (page3)– requires locking, hard to implement• GFS: append new records to the file atomically

GFSArchitecture

• eachfilestoredas64MBchunks

• eachchunkon3+chunkservers

• singlemasterstoresmetadata

“Single”MasterArchitecture

• Masterstoresmetadata:– Filenamespace,filename->chunklist– chunkID->listofchunkserversholdingit– Metadatastoredinmemory(~64B/chunk)

• Masterdoesnot storefile contents– Allrequestsforfiledatagodirectlytochunkservers

• Hotstandbyreplicationusingshadowmasters– Fastrecovery

• Allmetadataoperationsarelinearizable

MasterFaultTolerance

• Onemaster,setofreplicas–MasterchosenbyChubby

• Masterlogs(some)metadataoperations– Changestonamespace,ACLs,file->chunkIDs– NotchunkID->chunkserver;whynot?

• Replicateoperationsatshadowmastersandlogtodisk,thenexecuteop

• Periodiccheckpointofmasterin-memorydata– Allowsmastertotruncatelog,speedrecovery– Checkpointproceedsinparallelwithnewops

HandlingWriteOperations

• Mutationiswriteorappend• Goal:minimizemasterinvolvement

• Leasemechanism– Masterpicksonereplicaasprimary;givesitalease

– Primarydefinesaserialorderofmutations

• Dataflowdecoupledfromcontrolflow

WriteOperations

• Applicationoriginateswriterequest

• GFSclienttranslatesrequestfrom(fname,data)-->(fname,chunk-index)sendsittomaster

• Masterrespondswithchunkhandleand(primary+secondary)replicalocations

• Clientpusheswritedatatoalllocations;dataisstoredinchunkservers’internalbuffers

• Clientsendswritecommandtoprimary

WriteOperations(contd.)• Primarydeterminesserialorderfordatainstancesstoredinitsbufferandwritestheinstancesinthatordertothechunk

• Primarysendsserialordertothesecondariesandtellsthemtoperformthewrite

• Secondariesrespondtotheprimary

• Primaryrespondsbacktoclient

• Ifwritefailsatoneofthechunkservers,clientisinformedandretriesthewrite/append,butanotherclientmayreadstaledatafromchunkserver

AtLeastOnceAppend

• Iffailureatprimaryoranyreplica,retryappend(atnewoffset)– Appendwilleventuallysucceed!–Maysucceedmultipletimes!

• Appclientlibraryresponsiblefor– Detectingcorruptedcopiesofappendedrecords– Ignoringextracopies(duringstreamingreads)

• Whynotappendexactlyonce?

Question

DoestheBigTable tabletserveruse“atleastonceappend”foritsoperationlog?

Caching

• GFScachesfilemetadataonclients– Ex:chunkID->chunkservers– Usedasahint:invalidateonuse– TBfile=>16Kchunks

• GFSdoesnotcache filedata onclients

GarbageCollection

• Filedelete=>renametoahiddenfile• Backgroundtaskatmaster– Deleteshiddenfiles– Deletesanyunreferencedchunks

• Simplerthanforegrounddeletion–Whatifchunkserverispartitionedduringdelete?

• NeedbackgroundGCanyway– Stale/orphanchunks

DataCorruption

• FilesstoredonLinux,andLinuxhasbugs– Sometimessilentcorruptions

• Filesstoredondisk,anddisksarenotfail-stop– Storedblockscanbecomecorruptedovertime– Ex:writestosectorsonnearbytracks– Rareeventsbecomecommonatscale

• Chunkservers maintainper-chunkCRCs(64KB)– LocallogofCRCupdates– VerifyCRCsbeforereturningreaddata– Periodicrevalidationtodetectbackgroundfailures

Discussion

• Isthisagooddesign?• Canweimproveonit?• Willitscaletoevenlargerworkloads?

~15yearslater

• Scaleismuchbigger:– now10Kserversinsteadof1K– now100PBinsteadof100TB

• Biggerworkloadchange:updatestosmallfiles!• Around2010:incrementalupdatesoftheGooglesearch index

GFS->Colossus

• GFSscaledto~50millionfiles,~10PB• Developershadtoorganizetheirappsaroundlargeappend-onlyfiles(seeBigTable)

• Latency-sensitiveapplicationssuffered• GFSeventuallyreplacedwithanewdesign,Colossus

Metadatascalability

• Mainscalabilitylimit:singlemasterstoresallmetadata

• HDFShassameproblem(singleNameNode)• Approach:partitionthemetadataamongmultiplemasters

• Newsystemsupports~100Mfilespermasterandsmallerchunksizes:1MBinsteadof64MB

ReducingStorageOverhead

• Replication:3xstoragetohandletwocopies

• Erasurecodingmoreflexible:mpieces,ncheckpieces

– e.g.,RAID-5:2disks,1paritydisk(XORofothertwo)=>1failurew/only1.5storage

• Sub-chunkwritesmoreexpensive(read-modify-write)

• Afterafailure: getalltheotherpieces,generatemissingone

ErasureCoding

• 3-wayreplication:3xoverhead,2failurestolerated,easyrecovery

• GoogleColossus:(6,3)Reed-Solomoncode1.5xoverhead,3failures

• FacebookHDFS:(10,4)Reed-Solomon1.4xoverhead,4failures,expensiverecovery

• Azure:moreadvancedcode(12,4)1.33x,4failures,samerecoverycostasColossus

Documents

Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,