Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
GFS
ArvindKrishnamurthy(basedonslidesfromTomAnderson
&DanPorts)
GoogleStack
– GFS:large-scalestorageforbulkdata– Chubby:Paxosstorageforcoordination– BigTable:semi-structureddatastorage–MapReduce:bigdatacomputationonkey-valuepairs
–MegaStore,Spanner:transactionalstoragewithgeo-replication
GFS
• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex
• WhynotuseNFS?(Thatis,someexistingdistributedfilesystem.)
GFS
• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex
• WhynotuseNFS?– verydifferentworkloadcharacteristics!– designGFSforGoogleapps,Googleappsfor GFS
• Requirements:– Faulttolerance,availability,throughput,scale– Concurrentstreamingreadsandwrites
GFSWorkload
• Producer/consumer– Hundredsofwebcrawlingclients– PeriodicbatchanalyticjobslikeMapReduce– Throughput,notlatency
• Bigdatasets(forthetime):– 1000servers,300TBofdatastored
• Later:BigTable tabletlogandSSTables• Evenlater:Workloadnow?
GFSWorkload
• Fewmillion100MB+files–Many arehuge
• Reads:–Mostlylargestreamingreads– Somesortedrandomreads
• Writes:–Mostfileswrittenonce,neverupdated–Mostwritesareappends,e.g.,concurrentworkers
GFSInterface
• app-levellibrary– notakernelfilesystem– NotaPOSIXfilesystem
• create,delete,open,close,read,write,append–Metadataoperationsarelinearizable– Filedata eventuallyconsistent (stalereads)
• Inexpensivefile,directorysnapshots
Lifewithoutrandomwrites
• Results of a previous crawl:www.page1.com -> www.my.blogspot.comwww.page2.com -> www.my.blogspot.com• New results: page2 no longer has the link, but there is a new
page, page3:www.page1.com -> www.my.blogspot.comwww.page3.com -> www.my.blogspot.com• Option: delete old record (page2); insert new record (page3)– requires locking, hard to implement• GFS: append new records to the file atomically
GFSArchitecture
• eachfilestoredas64MBchunks
• eachchunkon3+chunkservers
• singlemasterstoresmetadata
“Single”MasterArchitecture
• Masterstoresmetadata:– Filenamespace,filename->chunklist– chunkID->listofchunkserversholdingit– Metadatastoredinmemory(~64B/chunk)
• Masterdoesnot storefile contents– Allrequestsforfiledatagodirectlytochunkservers
• Hotstandbyreplicationusingshadowmasters– Fastrecovery
• Allmetadataoperationsarelinearizable
MasterFaultTolerance
• Onemaster,setofreplicas–MasterchosenbyChubby
• Masterlogs(some)metadataoperations– Changestonamespace,ACLs,file->chunkIDs– NotchunkID->chunkserver;whynot?
• Replicateoperationsatshadowmastersandlogtodisk,thenexecuteop
• Periodiccheckpointofmasterin-memorydata– Allowsmastertotruncatelog,speedrecovery– Checkpointproceedsinparallelwithnewops
HandlingWriteOperations
• Mutationiswriteorappend• Goal:minimizemasterinvolvement
• Leasemechanism– Masterpicksonereplicaasprimary;givesitalease
– Primarydefinesaserialorderofmutations
• Dataflowdecoupledfromcontrolflow
WriteOperations
• Applicationoriginateswriterequest
• GFSclienttranslatesrequestfrom(fname,data)-->(fname,chunk-index)sendsittomaster
• Masterrespondswithchunkhandleand(primary+secondary)replicalocations
• Clientpusheswritedatatoalllocations;dataisstoredinchunkservers’internalbuffers
• Clientsendswritecommandtoprimary
WriteOperations(contd.)• Primarydeterminesserialorderfordatainstancesstoredinitsbufferandwritestheinstancesinthatordertothechunk
• Primarysendsserialordertothesecondariesandtellsthemtoperformthewrite
• Secondariesrespondtotheprimary
• Primaryrespondsbacktoclient
• Ifwritefailsatoneofthechunkservers,clientisinformedandretriesthewrite/append,butanotherclientmayreadstaledatafromchunkserver
AtLeastOnceAppend
• Iffailureatprimaryoranyreplica,retryappend(atnewoffset)– Appendwilleventuallysucceed!–Maysucceedmultipletimes!
• Appclientlibraryresponsiblefor– Detectingcorruptedcopiesofappendedrecords– Ignoringextracopies(duringstreamingreads)
• Whynotappendexactlyonce?
Question
DoestheBigTable tabletserveruse“atleastonceappend”foritsoperationlog?
Caching
• GFScachesfilemetadataonclients– Ex:chunkID->chunkservers– Usedasahint:invalidateonuse– TBfile=>16Kchunks
• GFSdoesnotcache filedata onclients
GarbageCollection
• Filedelete=>renametoahiddenfile• Backgroundtaskatmaster– Deleteshiddenfiles– Deletesanyunreferencedchunks
• Simplerthanforegrounddeletion–Whatifchunkserverispartitionedduringdelete?
• NeedbackgroundGCanyway– Stale/orphanchunks
DataCorruption
• FilesstoredonLinux,andLinuxhasbugs– Sometimessilentcorruptions
• Filesstoredondisk,anddisksarenotfail-stop– Storedblockscanbecomecorruptedovertime– Ex:writestosectorsonnearbytracks– Rareeventsbecomecommonatscale
• Chunkservers maintainper-chunkCRCs(64KB)– LocallogofCRCupdates– VerifyCRCsbeforereturningreaddata– Periodicrevalidationtodetectbackgroundfailures
Discussion
• Isthisagooddesign?• Canweimproveonit?• Willitscaletoevenlargerworkloads?
~15yearslater
• Scaleismuchbigger:– now10Kserversinsteadof1K– now100PBinsteadof100TB
• Biggerworkloadchange:updatestosmallfiles!• Around2010:incrementalupdatesoftheGooglesearch index
GFS->Colossus
• GFSscaledto~50millionfiles,~10PB• Developershadtoorganizetheirappsaroundlargeappend-onlyfiles(seeBigTable)
• Latency-sensitiveapplicationssuffered• GFSeventuallyreplacedwithanewdesign,Colossus
Metadatascalability
• Mainscalabilitylimit:singlemasterstoresallmetadata
• HDFShassameproblem(singleNameNode)• Approach:partitionthemetadataamongmultiplemasters
• Newsystemsupports~100Mfilespermasterandsmallerchunksizes:1MBinsteadof64MB
ReducingStorageOverhead
• Replication:3xstoragetohandletwocopies
• Erasurecodingmoreflexible:mpieces,ncheckpieces
– e.g.,RAID-5:2disks,1paritydisk(XORofothertwo)=>1failurew/only1.5storage
• Sub-chunkwritesmoreexpensive(read-modify-write)
• Afterafailure: getalltheotherpieces,generatemissingone
ErasureCoding
• 3-wayreplication:3xoverhead,2failurestolerated,easyrecovery
• GoogleColossus:(6,3)Reed-Solomoncode1.5xoverhead,3failures
• FacebookHDFS:(10,4)Reed-Solomon1.4xoverhead,4failures,expensiverecovery
• Azure:moreadvancedcode(12,4)1.33x,4failures,samerecoverycostasColossus