25
GFS Arvind Krishnamurthy (based on slides from Tom Anderson & Dan Ports)

Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFS

ArvindKrishnamurthy(basedonslidesfromTomAnderson

&DanPorts)

Page 2: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GoogleStack

– GFS:large-scalestorageforbulkdata– Chubby:Paxosstorageforcoordination– BigTable:semi-structureddatastorage–MapReduce:bigdatacomputationonkey-valuepairs

–MegaStore,Spanner:transactionalstoragewithgeo-replication

Page 3: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFS

• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex

• WhynotuseNFS?(Thatis,someexistingdistributedfilesystem.)

Page 4: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFS

• Needed: distributedfilesystemforstoringresultsofwebcrawlandsearchindex

• WhynotuseNFS?– verydifferentworkloadcharacteristics!– designGFSforGoogleapps,Googleappsfor GFS

• Requirements:– Faulttolerance,availability,throughput,scale– Concurrentstreamingreadsandwrites

Page 5: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFSWorkload

• Producer/consumer– Hundredsofwebcrawlingclients– PeriodicbatchanalyticjobslikeMapReduce– Throughput,notlatency

• Bigdatasets(forthetime):– 1000servers,300TBofdatastored

• Later:BigTable tabletlogandSSTables• Evenlater:Workloadnow?

Page 6: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFSWorkload

• Fewmillion100MB+files–Many arehuge

• Reads:–Mostlylargestreamingreads– Somesortedrandomreads

• Writes:–Mostfileswrittenonce,neverupdated–Mostwritesareappends,e.g.,concurrentworkers

Page 7: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFSInterface

• app-levellibrary– notakernelfilesystem– NotaPOSIXfilesystem

• create,delete,open,close,read,write,append–Metadataoperationsarelinearizable– Filedata eventuallyconsistent (stalereads)

• Inexpensivefile,directorysnapshots

Page 8: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

Lifewithoutrandomwrites

• Results of a previous crawl:www.page1.com -> www.my.blogspot.comwww.page2.com -> www.my.blogspot.com• New results: page2 no longer has the link, but there is a new

page, page3:www.page1.com -> www.my.blogspot.comwww.page3.com -> www.my.blogspot.com• Option: delete old record (page2); insert new record (page3)– requires locking, hard to implement• GFS: append new records to the file atomically

Page 9: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFSArchitecture

• eachfilestoredas64MBchunks

• eachchunkon3+chunkservers

• singlemasterstoresmetadata

Page 10: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

“Single”MasterArchitecture

• Masterstoresmetadata:– Filenamespace,filename->chunklist– chunkID->listofchunkserversholdingit– Metadatastoredinmemory(~64B/chunk)

• Masterdoesnot storefile contents– Allrequestsforfiledatagodirectlytochunkservers

• Hotstandbyreplicationusingshadowmasters– Fastrecovery

• Allmetadataoperationsarelinearizable

Page 11: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

MasterFaultTolerance

• Onemaster,setofreplicas–MasterchosenbyChubby

• Masterlogs(some)metadataoperations– Changestonamespace,ACLs,file->chunkIDs– NotchunkID->chunkserver;whynot?

• Replicateoperationsatshadowmastersandlogtodisk,thenexecuteop

• Periodiccheckpointofmasterin-memorydata– Allowsmastertotruncatelog,speedrecovery– Checkpointproceedsinparallelwithnewops

Page 12: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

HandlingWriteOperations

• Mutationiswriteorappend• Goal:minimizemasterinvolvement

• Leasemechanism– Masterpicksonereplicaasprimary;givesitalease

– Primarydefinesaserialorderofmutations

• Dataflowdecoupledfromcontrolflow

Page 13: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

WriteOperations

• Applicationoriginateswriterequest

• GFSclienttranslatesrequestfrom(fname,data)-->(fname,chunk-index)sendsittomaster

• Masterrespondswithchunkhandleand(primary+secondary)replicalocations

• Clientpusheswritedatatoalllocations;dataisstoredinchunkservers’internalbuffers

• Clientsendswritecommandtoprimary

Page 14: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

WriteOperations(contd.)• Primarydeterminesserialorderfordatainstancesstoredinitsbufferandwritestheinstancesinthatordertothechunk

• Primarysendsserialordertothesecondariesandtellsthemtoperformthewrite

• Secondariesrespondtotheprimary

• Primaryrespondsbacktoclient

• Ifwritefailsatoneofthechunkservers,clientisinformedandretriesthewrite/append,butanotherclientmayreadstaledatafromchunkserver

Page 15: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

AtLeastOnceAppend

• Iffailureatprimaryoranyreplica,retryappend(atnewoffset)– Appendwilleventuallysucceed!–Maysucceedmultipletimes!

• Appclientlibraryresponsiblefor– Detectingcorruptedcopiesofappendedrecords– Ignoringextracopies(duringstreamingreads)

• Whynotappendexactlyonce?

Page 16: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

Question

DoestheBigTable tabletserveruse“atleastonceappend”foritsoperationlog?

Page 17: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

Caching

• GFScachesfilemetadataonclients– Ex:chunkID->chunkservers– Usedasahint:invalidateonuse– TBfile=>16Kchunks

• GFSdoesnotcache filedata onclients

Page 18: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GarbageCollection

• Filedelete=>renametoahiddenfile• Backgroundtaskatmaster– Deleteshiddenfiles– Deletesanyunreferencedchunks

• Simplerthanforegrounddeletion–Whatifchunkserverispartitionedduringdelete?

• NeedbackgroundGCanyway– Stale/orphanchunks

Page 19: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

DataCorruption

• FilesstoredonLinux,andLinuxhasbugs– Sometimessilentcorruptions

• Filesstoredondisk,anddisksarenotfail-stop– Storedblockscanbecomecorruptedovertime– Ex:writestosectorsonnearbytracks– Rareeventsbecomecommonatscale

• Chunkservers maintainper-chunkCRCs(64KB)– LocallogofCRCupdates– VerifyCRCsbeforereturningreaddata– Periodicrevalidationtodetectbackgroundfailures

Page 20: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

Discussion

• Isthisagooddesign?• Canweimproveonit?• Willitscaletoevenlargerworkloads?

Page 21: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

~15yearslater

• Scaleismuchbigger:– now10Kserversinsteadof1K– now100PBinsteadof100TB

• Biggerworkloadchange:updatestosmallfiles!• Around2010:incrementalupdatesoftheGooglesearch index

Page 22: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

GFS->Colossus

• GFSscaledto~50millionfiles,~10PB• Developershadtoorganizetheirappsaroundlargeappend-onlyfiles(seeBigTable)

• Latency-sensitiveapplicationssuffered• GFSeventuallyreplacedwithanewdesign,Colossus

Page 23: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

Metadatascalability

• Mainscalabilitylimit:singlemasterstoresallmetadata

• HDFShassameproblem(singleNameNode)• Approach:partitionthemetadataamongmultiplemasters

• Newsystemsupports~100Mfilespermasterandsmallerchunksizes:1MBinsteadof64MB

Page 24: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

ReducingStorageOverhead

• Replication:3xstoragetohandletwocopies

• Erasurecodingmoreflexible:mpieces,ncheckpieces

– e.g.,RAID-5:2disks,1paritydisk(XORofothertwo)=>1failurew/only1.5storage

• Sub-chunkwritesmoreexpensive(read-modify-write)

• Afterafailure: getalltheotherpieces,generatemissingone

Page 25: Arvind Krishnamurthy (based on slides from Tom Anderson ...€¦ · Life without random writes • Results of a previous crawl: > > • New results: page2 no longer has the link,

ErasureCoding

• 3-wayreplication:3xoverhead,2failurestolerated,easyrecovery

• GoogleColossus:(6,3)Reed-Solomoncode1.5xoverhead,3failures

• FacebookHDFS:(10,4)Reed-Solomon1.4xoverhead,4failures,expensiverecovery

• Azure:moreadvancedcode(12,4)1.33x,4failures,samerecoverycostasColossus