51
Openstack swift 果凍

Openstack swift, how does it work?

Embed Size (px)

Citation preview

Page 1: Openstack swift, how does it work?

Openstack swift果凍

Page 2: Openstack swift, how does it work?

簡介

● 任職於迎棧科技○ 過去的迎廣科技雲端應用研發中

● python, django, linux, openstack, docker

● 業餘玩玩 scala

● http://about.me/ya790206

Page 3: Openstack swift, how does it work?

swift

1. Swift is a highly available, distributed, eventually consistent object/blob store.

2. Organizations can use Swift to store lots of data efficiently, safely, and cheaply.

3. written by python.4. microservice 架構

Page 4: Openstack swift, how does it work?

hdfs 和 swift 比較

● Swift is designed with multi-tenancy in mind, where HDFS has no notion of multi-tenancy

● HDFS uses a central system to maintain file metadata (Namenode), where as in Swift the metadata is distributed and replicated across the cluster.

Page 5: Openstack swift, how does it work?

ceph 和 swift 比較

Ceph○ Started in 2006○ Written in C++.○ Strongly consistent.○ Block storage.○ Object storage.

Swift○ Started in 2008○ Written in Python.○ Eventually consistent.○ Object storage.○ In production on really

large public clouds.

Page 6: Openstack swift, how does it work?

Strongly consistent vs Eventually consistent

● Eventually consistent○ if no new updates are made to a given data item,

eventually all accesses to that item will return the last updated value.

● Strongly consistent ○ If you successfully write a value to a key in a

strongly consistent system, the next successful read of that key is guaranteed to show that write.

Page 7: Openstack swift, how does it work?
Page 8: Openstack swift, how does it work?

swift data hierarchy

● account○ In the OpenStack environment, account is

synonymous with a project or tenant.● container

○ namespace for objects.○ access control list

● object

Page 9: Openstack swift, how does it work?

swift service 常見程序名稱

● proxy:負責接受 client 端的 request,並根據 request 來決定哪個 server 發 request。

● server:負責接受 proxy 的請求,處理 CRUD。部份程序會處理 replicator 的 request。

● updater:處理非同步更新 database 的請求

Page 10: Openstack swift, how does it work?

swift service 常見程序名稱

● auditor:當發現現在的資料 hash 和之前紀錄的 hash 不同時,則將資料移動到隔離區(quarantined)。

● replicator:當發現本機和遠端的資料 hash 不同時,則將本地資料推向遠端主機(不會覆寫已存在的檔案)

● reaper:for deleting something

Page 11: Openstack swift, how does it work?

Ring 概念

● 將 account, container, object 做 hash,產生一個值

● 若值介在 0~33,則放在 A 主機。

A: 0~33

B: 33~66

C: 66~99

Page 12: Openstack swift, how does it work?

根據 hash 演算法決定資料要存在哪個 node

data1 data2 data3

node1

data4

node2 node3

Page 13: Openstack swift, how does it work?

如果有新主機加入 ring

● 很多資料必須移動,讓他們到正確的 node 去。

A: 0~25

B: 25~50

C: 50~75

C: 75~50

Page 14: Openstack swift, how does it work?

如何解決加入主機後,有需多資料要移動的問題

● All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.

Page 15: Openstack swift, how does it work?

swift ring 的概念

Q:data 如何決定放在哪些 partition?A: 根據 hash 演算法

Q:partition如何決定放在哪 些node ?A: ring 紀錄哪些 partition 放在 node 裡

data1 data2 data3

node1

data4

node2 node3

Virtual partition1

Virtual partition2

Virtual partition3

Virtual partition4

data5

Page 16: Openstack swift, how does it work?

ceph

● 和 swift 一樣,不直接由 hash 值決定放哪。而是先放進 pg ,在由 pg 決定放哪個 node

Page 17: Openstack swift, how does it work?

ceph

● Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically. When a Ceph Client stores objects, CRUSH will map each object to a placement group.

Page 18: Openstack swift, how does it work?

ceph● Mapping objects to placement groups creates a layer of indirection

between the Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph Client “knew” which Ceph OSD Daemon had which object, that would create a tight coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH algorithm maps each object to a placement group and then maps each placement group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices come online. The following diagram depicts how CRUSH maps objects to placement groups, and placement groups to OSDs.

Page 19: Openstack swift, how does it work?

建立一個 ring 檔案

swift-ring-builder object.builder create 10 3 1

swift-ring-builder container.builder create 10 3 1swift-ring-builder account.builder create 10 3 1

2^10 個 virtual parition

資料備份三份 這個檔案要間隔一小時後才能在 rebanlance

Page 20: Openstack swift, how does it work?

● swift-ring-builder object.builder add r1z1-127.0.0.1:6010/sdb1 1

● swift-ring-builder object.builder add r1z2-127.0.0.1:6020/sdb2 1

● swift-ring-builder object.builder add r1z3-127.0.0.1:6030/sdb3 1

● swift-ring-builder object.builder add r1z4-127.0.0.1:6040/sdb4 1

建立一個 node(真的存放資料的機器)

Page 21: Openstack swift, how does it work?

swift ring 功能

● 決定 partition 的數量● 紀錄哪些 partition 放在哪些 node 裡

Page 22: Openstack swift, how does it work?

第一個副本

第二個副本

column 為 partition。0與 1 為 device 的 id

column 數目為 8。因為 2^3 是8

Page 23: Openstack swift, how does it work?

2^3=3

Page 24: Openstack swift, how does it work?

How to get partition from ring?

self._part_shift = 32 - part_power

至多 42 億個 partition

Page 25: Openstack swift, how does it work?

Ring 的旅程暫時告一段落

Page 26: Openstack swift, how does it work?

Proxy server

● 當 proxy.server 收到請求後,根據 req.path 來來決定 controller (proxy.controller.obj or proxy.controller.account or proxy.controller.container)

container-server

object-server

proxy-server

object-server

account-server

account-server

proxy-server

proxy-server

proxy.server -> proxy.controller

Page 27: Openstack swift, how does it work?

Proxy server

● proxy.server 也會決定要呼叫 controller 哪個 method

● controller 則會向 object-server or container-server or account-server 發請求 。

container-server

object-server

proxy-server

object-server

account-server

account-server

proxy-server

Page 28: Openstack swift, how does it work?

Proxy server

● How to select object-server for get?○ 找出所有可能主機○ 剔除已經在使用的主機

○ 第一台能在 0.5 秒內回應

的主機

Page 29: Openstack swift, how does it work?

Proxy server

● How to select object-server for put?○ 從 ring 找出主機列表

○ 將資料寫入到主機列表

中所有主機找出主機並建立連線

上傳資料到 object-server

= (n // 2) + 1

Page 30: Openstack swift, how does it work?

object-server

● 和 object 相關的動作由 object-server 負責。○ CRUD for object and object meta.○ 存放 object meta 在 file system xattr。

● 根據 account name, container name, object name 決定物件存放位置。

Page 31: Openstack swift, how does it work?

sdb1/objects/18/a97/90286a5e5b4aeb7370b1091f23151a97/1426663507.97553.data

sdb1/

objects/

18/

a97/90286a5e5b4aeb7370b1091f23151a97/1426663507

hash_path,由 account, container, object 三個名稱決定。

hash_path 的後三碼

partition,由 account, container, object, ring 所決定。在 proxy-server 時就已經決定。

accounts, containers, objects 三選一

device name,由 account, container, object, ring 所決定。在 proxy-server 時就已經決定。

副檔名決定檔案是存在或是被刪除

Page 32: Openstack swift, how does it work?

object-auditor

● 檢查 object 完整性○ 計算物件的 hash ,比較是否和紀錄相同(紀錄放在

xattr 裡)○ 當發現 hash 不同時,則會將物件放到 Quarantined

目錄。

● 如何找出該主機所存放的物件?○ 列舉特定目錄下滿足特定路徑規則的所有檔案

Page 33: Openstack swift, how does it work?

Object-autitor how to find all object?

Page 34: Openstack swift, how does it work?

第一個 listdir

第二個 listdir

第三個 listdir

Page 35: Openstack swift, how does it work?

Object-replicator

● 知錯能改,善莫大焉:○ 知錯:object-auditor○ 能改:object-replicator

● Object-replicator 會比對本地和遠端機器的副本 hash 是否相同,如果不同則執行 rsync 指令,將資料 push 到遠端。

Page 36: Openstack swift, how does it work?

Object-replicator● 萬一是自己的資料錯了,Object-replicator

push 資料到遠端,不是會讓錯誤擴散?○ 當自己的資料有誤時,Object-replicator 會嘗試透過

rysnc push 資料到其他台,但因為 --ignore-existing 參數的關係,實際上不會 push 任何資料。

○ Object-replicator 不會主動 pull 其他主機資料到本地。

只能等待其他主機 push。○ object-auditor 察覺到本地資料有誤,則將本地資料放

到隔離區。遠端 Object-replicator 比對 hash 時,發現 hash 不同,遠端Object-replicator則 push 資料到本

地。

Page 37: Openstack swift, how does it work?

ObjectReplicator.update

Page 38: Openstack swift, how does it work?

不要覆蓋遠端的檔案

同步檔案的 meta

Page 39: Openstack swift, how does it work?

只有一個 os.listdir

rsync 是以 partition 為單位/srv/a/objects/30

Page 40: Openstack swift, how does it work?

資料復原 - 情境分析

● 修改副本内容:○ (A 電腦的)object-auditor 比對檔案和檔案 xattr 的屬

性(檔案長度與 hash),發現內容有誤,則將資料放進

隔離區。並修改 hahses.pkl○ 遠端的 object-replicator 發現 A 電腦的資料 hashes.

pkl 和自己的不同,則會試圖 push 自己的電腦資料到 A 電腦

Page 41: Openstack swift, how does it work?

資料復原 - 情境分析

● 刪除副本與 hashes.pkl:○ 因為 object-auditor 是根據檔案來找出該電腦有哪些

物件。但是檔案被刪除,因此 object-auditor 不知道有

這個物件的存在。

○ 遠端的 object-replicator 要比對 hashes.pkl 的 hash,但是 hashes.pkl 被刪除。遠端的 object-replicator 要重新計算 hash,但是檔案已經被刪除,無法重新計算 hash。遠端的 object-replicator 因此執行 rsync 動作

Page 42: Openstack swift, how does it work?

資料復原 - 情境分析

● 刪除副本:○ 因為 object-auditor 是根據檔案來找出該電腦有哪些

物件。但是檔案被刪除,因此 object-auditor 不知道有

這個物件的存在。而該物件的 hash 也不會被修改。

○ 因為該物件的 hashes.pkl 沒被修改,因此遠端的 object-replicator 不會發現該物件已被刪除。

○ 但是遠端的 object-replicator 還是有可能會還原該副

本,因為 rsync 是以 partition 為單位,而非物件。

Page 43: Openstack swift, how does it work?

資料復原 - 情境分析

● 刪除hashes.pkl:○ 因為 hashes.pkl 被刪除,因此遠端的 object-replicator

在比對 hash 時會出錯,因此會同步目錄。

Page 44: Openstack swift, how does it work?

container-server

● 和 object-server 運作方式相似,除了是存放 sqlite 檔案○ 每次對 db 更動都是一筆新紀錄○ 最新的資料一定是 primary key 最大的○ 合併時,保留 primary key 最大的

● 也會負責處理 ReplicatorRpc○ merge_syncs○ merge_items○ complete_rsync

Page 45: Openstack swift, how does it work?

container-auditor

● 只檢查檔案是否能夠被開啟。

● 檔案不能被開啟,則會被隔離。

Page 46: Openstack swift, how does it work?

● 負責 db 的同步與合併● 同步方法:

○ 傳送差異:db_replicator.Replicator._usync_db○ rsync:db_replicator.Replicator._rsync_db

● db 合併,保留 primary key 最大的

○ 每次對 db 更動都是一筆新紀錄○ 最新的資料一定是 primary key 最大的

container-replicator

Page 47: Openstack swift, how does it work?

container-replicator

● 當 db 被刪除時:○ 遠端主機的 container-replicator 透過 http request 向

本機的 container-server 要求資料

○ 因為 db 被刪除,所以 container-server 回傳找不到檔

案○ 遠端主機透過 rsync 來 push 被刪除的 db。

● 當發現檔案不能被開啟,則會隔離檔案。(與 container-auditor 重複)

Page 48: Openstack swift, how does it work?

總結

● swift 架構● swift ring 介紹

○ a layer of indirection● proxy-server 如何存取 object● object-server 與修復情境分析● container-server 簡易分析

Page 49: Openstack swift, how does it work?

總結

洞悉事物的本質,就不會被表象所迷惑