DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

DHT2 - O Brother, Where Art Thou?Shyamsundar RanganathanDeveloper

Session aims to explore... "The hypothetical treasure at the end of the journey"

Why DHT2 "The plan..." DHT2 design "Known adventures along the way!"Challenges in DHT2 "The strange characters"Challenges because of DHT2 "Trouble escaping the chain gang!"Where are we with DHT2Loosely inspired by the movie: https://en.wikipedia.org/wiki/O_Brother,_Where_Art_Thou%3F

Why DHT2DHT pitfalls

Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!

Why DHT2DHT pitfalls

Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!

Correctness can be addressed in DHT,Broader locking semantics for dentry operationsPossibly single layout adoptionBut, increases complexity and could cost performance!

With DHT2 the goal is to fix all of the above, retaining or improving performance

DHT2 Design: The file system objectsView the file system as a collection of related objects

”wait a second... isn't that what inodes and data pointers are?”Yes, but they are not distributed!

Directory objects denote hierarchystoring <name,inode#> tables

File object maintains inode related metadataActual file data is maintained in data object(s)

The file system objects (example)

Client View. ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Dir Object File Object

Dir2Dir1

inodes/dinode File data

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

The different objects, segregated by type

Dir2Dir1

A Data Object

Namespace hierarchy representation

Dir2Dir1

A Data Object

Data association

DHT2 Design: Distribution detailsDistribute inodes using GFID

in the metadata ringNo hierarchy, a directory object lives only on one subvolume

Use GFID as the data object#in the data ring

Distribution is hence not name dependent, and we just use a single layout per ring

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Switch names to GFID, add name to dinodes

7525BA11

Data Object

DHT2 Design: Distribution details (contd.)Layout is based on bucket to subvolume assignment

Where, buckets >> subvolumesBucket ID is encoded into first n bytes of the GFID

Trivial GFID based operations

Collocates file object with parent objectFile object# statically inherits parent directory# bucket IDOptimized readirp and lookup operations (no hopping unless

non-trivially renamed, or a link file)IOW, optimized (pGFID, basename) based operations

7525BA11

Data Object

Bricks/Subvols

Add bricks/subvolumes

7525BA11

Data Object

Bricks/Subvols

Buckets

Assign buckets to bricks

7525BA11

Data Object

Bricks/Subvols

Buckets

Place directories based on bucket encoded in the GFID

7525BA11

Data Object

Bricks/Subvols

Buckets

Colocate the files under a directory with the same bucket ID

DHT2 Design: RebalanceReassign buckets to/from newer/removed subvolumes

fix-layout is instantaneousFiles travel with directories (same bucket colocation)

Expand the cluster, but perform no rebalanceaka just add-brick and let min-free-disk+link-to do its job This is the tough one, use layout versions/histories to pull this

Split DHT2 into client-server piecesHandle IO traffic, locking during rebalanceBetter consistency model for transactions

Ability to have different expansions strategies for the 2 rings

Challenges in DHT2Rename ELOOP checking requires hierarchy

Object backpointers

Time and size information should be in sync between data and metadata objectsDirty inode, tracked via open fd

Orphan GFID cleanupEnter transactions/journals!

Directories as files/in a DBReduce local FS inode proliferation

Challenges because of DHT2IO path cannot depend on hierarchy (Ex: quota)Quick-read cannot fetch data in lookupsAnon-fd based operations cannot track dirty inodesOthers

Will changelog play well!EC has to bother with only data?Tier may need a rethinkSharding may accrue cost of missing anon-fd and data/meta-

data split of shards

Unknowns!

Where are we with DHT2Introduced DHT Version 2 in Barcelona summit, 2015

Followed up with 2 discussions upstream on core concepts [1] [2]

Followed up with a POC and some slides/documents to demonstrate the concepts [3]

In a limbo since then,But, not out of the picture yet!

Targeting an experimental release with 4.0

Questions?

"The treasure you seek shall not be the treasure you find."

References[1] DHT2 Design Discussion

https://goo.gl/tLpqJO[2] DHT2 Design Discussion, Round 2https://goo.gl/dCAO36[3] POC trail…http://www.gluster.org/pipermail/gluster-devel/2015-August/046369.html

Other threads of interest:

- http://www.gluster.org/pipermail/gluster-devel/2016-March/048874.html

- http://www.gluster.org/pipermail/gluster-devel/2015-November/047098.html

- http://www.gluster.org/pipermail/gluster-devel/2015-September/046630.html

DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Technology

CK Ranganathan profile

ROMESH RANGANATHAN

Harish Ranganathan Application Platform Evangelist ...download.microsoft.com/.../Harish_Understanding_ASPNET.pdf · Harish Ranganathan Application Platform Evangelist Microsoft India

Come Thou Fount, Come Thou King (Preview only)

Javed Mohammed Khan and Shoba Ranganathan

Shyam Gollakota

Ranganathan in the 21st Century

Nisha Ranganathan, Rebecca Johnson, Andrew M. Edwardsevent.federationinfectionsocieties.com/wp-content/... · 2017. 12. 18. · Nisha Ranganathan, Rebecca Johnson, Andrew M. Edwards

Shyam cable catalogue

Rangra Ranganathan Apostle of Librarianship 1992

As 5 leis da biblioteconomia - Ranganathan

Ananth Ranganathan - Freewheeling

Leyes de Ranganathan adaptadas

RANGANATHAN AND THE ANNALS - NOPR: Home

Shyam bandhu

‘SHYAM SINGHA

Shiyali ramamrita ranganathan

Shyam Krishnan

Reordering Ranganathan: Shifting User Behaviors, Shifting

RANGANATHAN POLYTECHNIC COLLEGE DEPARTMENT OF …