DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

Preview:

Citation preview

DHT2 - O Brother, Where Art Thou?Shyamsundar RanganathanDeveloper

Session aims to explore... "The hypothetical treasure at the end of the journey"

Why DHT2 "The plan..." DHT2 design "Known adventures along the way!"Challenges in DHT2 "The strange characters"Challenges because of DHT2 "Trouble escaping the chain gang!"Where are we with DHT2Loosely inspired by the movie: https://en.wikipedia.org/wiki/O_Brother,_Where_Art_Thou%3F

Why DHT2DHT pitfalls

Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!

Why DHT2DHT pitfalls

Directories on all subvolumesLayout per directoryRebalance IO path handling and nonoptimal data movementThis impacts scale and correctness!

Correctness can be addressed in DHT,Broader locking semantics for dentry operationsPossibly single layout adoptionBut, increases complexity and could cost performance!

With DHT2 the goal is to fix all of the above, retaining or improving performance

DHT2 Design: The file system objectsView the file system as a collection of related objects

”wait a second... isn't that what inodes and data pointers are?”Yes, but they are not distributed!

Directory objects denote hierarchystoring <name,inode#> tables

File object maintains inode related metadataActual file data is maintained in data object(s)

The file system objects (example)

Client View. ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Dir Object File Object

Data

Data

root

File2

Dir2Dir1

File1

The file system objects (example)

inodes/dinode File data

1

A

CB

D

A

D

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

The different objects, segregated by type

Dir Object File Object

Data

Data

root

File2

Dir2Dir1

File1

The file system objects (example)

inodes/dinode File data

1

A

CB

D

A

D

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Namespace hierarchy representation

Dir Object File Object

Data

Data

root

File2

Dir2Dir1

File1

The file system objects (example)

inodes/dinode File data

1

A

CB

D

A

D

A Data Object

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Data association

DHT2 Design: Distribution detailsDistribute inodes using GFID

in the metadata ringNo hierarchy, a directory object lives only on one subvolume

Use GFID as the data object#in the data ring

Distribution is hence not name dependent, and we just use a single layout per ring

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

00EF

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Switch names to GFID, add name to dinodes

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

00EF

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

DHT2 Design: Distribution details (contd.)Layout is based on bucket to subvolume assignment

Where, buckets >> subvolumesBucket ID is encoded into first n bytes of the GFID

Trivial GFID based operations

Collocates file object with parent objectFile object# statically inherits parent directory# bucket IDOptimized readirp and lookup operations (no hopping unless

non-trivially renamed, or a link file)IOW, optimized (pGFID, basename) based operations

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Add bricks/subvolumes

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

00

75

BA

00

BA

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Buckets

Assign buckets to bricks

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

00

75

BA

00

BA

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Buckets

Place directories based on bucket encoded in the GFID

00EF

Dir Object File Object

BAC5

00EF

0001

BAC5

7525BA11

Distribution details (example)

Metadata Ring(few bricks)

Data Ring(many bricks)

1

A

CB

D

A

D

Data Object

<File1, 00EF><Dir1, BA11>

<File2, BAC5><Dir2, 7525>

Bricks/Subvols

00

75

BA

00

BA

Client View ('root')├── Dir1 │ ├── Dir2 │ └── File2 └── File1

Buckets

Colocate the files under a directory with the same bucket ID

DHT2 Design: RebalanceReassign buckets to/from newer/removed subvolumes

fix-layout is instantaneousFiles travel with directories (same bucket colocation)

Expand the cluster, but perform no rebalanceaka just add-brick and let min-free-disk+link-to do its job This is the tough one, use layout versions/histories to pull this

off?

Split DHT2 into client-server piecesHandle IO traffic, locking during rebalanceBetter consistency model for transactions

Ability to have different expansions strategies for the 2 rings

Challenges in DHT2Rename ELOOP checking requires hierarchy

Object backpointers

Time and size information should be in sync between data and metadata objectsDirty inode, tracked via open fd

Orphan GFID cleanupEnter transactions/journals!

Directories as files/in a DBReduce local FS inode proliferation

Challenges because of DHT2IO path cannot depend on hierarchy (Ex: quota)Quick-read cannot fetch data in lookupsAnon-fd based operations cannot track dirty inodesOthers

Will changelog play well!EC has to bother with only data?Tier may need a rethinkSharding may accrue cost of missing anon-fd and data/meta-

data split of shards

Unknowns!

Where are we with DHT2Introduced DHT Version 2 in Barcelona summit, 2015

Followed up with 2 discussions upstream on core concepts [1] [2]

Followed up with a POC and some slides/documents to demonstrate the concepts [3]

In a limbo since then,But, not out of the picture yet!

Targeting an experimental release with 4.0

Questions?

"The treasure you seek shall not be the treasure you find."

References[1] DHT2 Design Discussion

https://goo.gl/tLpqJO[2] DHT2 Design Discussion, Round 2https://goo.gl/dCAO36[3] POC trail…http://www.gluster.org/pipermail/gluster-devel/2015-August/046369.html

Other threads of interest:

- http://www.gluster.org/pipermail/gluster-devel/2016-March/048874.html

- http://www.gluster.org/pipermail/gluster-devel/2015-November/047098.html

- http://www.gluster.org/pipermail/gluster-devel/2015-September/046630.html