45
使用SMACK開發小型 Hybrid DB 系統 (過的坑)心得分享 許致軒 (Joe) PilotTV / Data Engineer

SMACK Dev Experience

Embed Size (px)

Citation preview

Page 1: SMACK Dev Experience

使用SMACK開發小型 Hybrid DB 系統(踩過的坑)心得分享

許致軒 (Joe)PilotTV / Data Engineer

Page 2: SMACK Dev Experience

● Chih-Hsuan Hsu (Joe)● PilotTV Data Engineer● Interestd in SMACK/ELK architecture● 技術書籍譯者

○ Spark學習手冊

○ Cassandra技術手冊

● LinkIn:www.linkedin.com/in/joechh● Mail:[email protected]

2

About Me

Page 3: SMACK Dev Experience

既有系統改造Story● RDB負載與日俱增

● 單點失效

● 想將Batch Mode改造成Straming-based Data Flow● 第一階段為了簡化只採用SMACK中的Spark、Kafka與Cassandra

○ 因為還不熟Mesos跟Akka…QQ

3

Page 4: SMACK Dev Experience

系統改造前/(預期)改造後

ETL

New ETL with Kafka Producer

4

Page 5: SMACK Dev Experience

New ETL

● 以Java實做

● ETL結果會產生Json-format串流資料

● 透過Kafka Producer API將Json Streaming送到Kafka Cluster

︰用預設值建立的Kafka Producer throughput太低

New ETL with Kafka Producer

5

Page 6: SMACK Dev Experience

開始研究Kafka Producer參數實驗(0.8.2)

參數 預設值 可用選項

producer.type sync sync, async

compression.codec none none, gzip, snappy

batch.num.messages 200 unlimited

request.required.acks 0 -1, 0, 1

queue.buffering.max.messages 10000 unlimited

https://kafka.apache.org/082/documentation.html

6

Page 7: SMACK Dev Experience

Kafka Producer: producer.type● 將producer.type設定async可開啟批次傳輸模式

● 批次模式有較佳的Throughput,但客戶端忽然當機時有Data Loss的可能

6.3X

7

Page 8: SMACK Dev Experience

Kafka Producer: batch.num.messages

● 使用async模式時,一個批次傳輸的資料量

● 批次就會送出時機

○ 資料量達到batch.num.messages○ 超過queue.buffer.max.ms的等待時間

2.55X 2.53X

8

Page 9: SMACK Dev Experience

Kafka Producer: queue.buffering.max.messages● Queue中允許暫存的訊息數量

1.05X

9

Page 10: SMACK Dev Experience

Kafka Producer: compression.codec● 支援輸出串流壓縮

2.14X3.02X

10

Page 11: SMACK Dev Experience

Kafka Producer: request.required.acks● 0:無須與Kafka Cluster進行資料接收確認(ack)● 1:僅與Repica Leader進行ack● -1:與所有Repica都進行ack

6.3X2.94X

11

Page 12: SMACK Dev Experience

● request.required.acks = 1 還是有可能掉資料

● sync with leader node不代表有容錯

Kafka Producer: request.required.acks(con)

1 2 3 4

1 2 3 4

1 2 3 4Replica Follower 1

ReplicaLeader 5 6producer sent

Replica Follower 2

12

Page 13: SMACK Dev Experience

Kafka Producer: request.required.acks(con)

● request.required.acks = 1 還是有可能掉資料

● sync with leader node不代表有容錯

1 2 3 4

1 2 3 4

1 2 3 4

5 6 ack return

Replica Follower 1

ReplicaLeader

Replica Follower 2

13

Page 14: SMACK Dev Experience

Kafka Producer: request.required.acks(con)

● request.required.acks = 1 還是有可能掉資料

● sync with leader node不代表有容錯

1 2 3 4

1 2 3 4

1 2 3 4

5 6

Replica Follower 1

ReplicaLeader

Replica Follower 2

14

Page 15: SMACK Dev Experience

Kafka Producer: request.required.acks(con)

● request.required.acks = -1● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2

1 2 3 4

1 2 3 4

1 2 3 4

5 6producer sent

Replica Follower 1

ReplicaLeader

Replica Follower 2

15

Page 16: SMACK Dev Experience

Kafka Producer: request.required.acks(con)

1 2 3 4

1 2 3 4

1 2 3 4

5 6

all replicas sync, ack return

● request.required.acks = -1● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2

5 6Replica Follower 1

ReplicaLeader

Replica Follower 2

16

5 6

Page 17: SMACK Dev Experience

Kafka Producer: request.required.acks(con)

1 2 3 4

1 2 3 4

1 2 3 4

5 6

● request.required.acks = -1● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2

5 6Replica Follower 1

ReplicaLeader

Replica Follower 2

17

5 6

Page 18: SMACK Dev Experience

Java Lamba Streaming

● 實做中發現parallelStream()也有助於提昇througput!

3.47X

18

Page 19: SMACK Dev Experience

另外關於Kafka Producer物件本身.....● Thread safe, 所以可以讓所有threads共享

19

Page 20: SMACK Dev Experience

Kafka Producer 實驗結論

參數 最後採用值 可用選項

producer.type async sync, async

compression.codec snappy none, gzip, snappy

batch.num.messages 1000 unlimited

request.required.acks 0 -1, 0, 1

queue.buffering.max.messages 20000 unlimited

● 還是必須根據需求實際測試以及對Data Loss的容忍度

● Latency v.s. Throughput20

Page 21: SMACK Dev Experience

Spark streaming

● 以Scala實做

● 多個Kafka中的Streamings做Client-sideJoin● 將Join結果寫入(upsert)SQL server以維護既有架構

New ETL with Kafka Producer

upsertstreaming

join

21

Page 22: SMACK Dev Experience

Spark streaming at beginning

︰悲劇的運算throughput RRRRRRRR!! (Join 500Msgs/sec!!) 22

Page 23: SMACK Dev Experience

DB Lock Resource● 檢查DB之後發現,Spark執行upsert時將Lock Resource用光.....

23

Page 24: SMACK Dev Experience

● 建立一張plain的base table(沒有任何index)● 以insert取代upsert,再透過store-procedure進行二次Aggregation

Solution

append

aggregation

24

Page 25: SMACK Dev Experience

Throughput Improvement (Join 500 -> 13000 msgs/sec)

25

Page 26: SMACK Dev Experience

Spark with RDB的另一個(坑)注意事項

● SQL Server 最大允許同時連線數為32,767● 不要問我為何知道............● 無論是使用哪一套connection pool,要注意計算總連線數

● Total Connection = connection pool size * spark executor number

26

Page 27: SMACK Dev Experience

使用mapWithState()的Stateful API做Mapping時..● 條件允許時可以設定timeout移除KV降低table的記憶體使用量

● 以最後一次KV被讀取的時間計算

27

Page 28: SMACK Dev Experience

Spark submit 一些好用的config

Ref: https://spark.apache.org/docs/2.0.0-preview/configuration.html

● supervise

● spark.streaming.backpressure.enabled

● spark.streaming.backpressure.initialRate

● spark.streaming.kafka.maxRatePerPartition

● spark.executor.extraJavaOptions○ -XX:+UseConcMarkSweepGC

● spark.cleaner.referenceTracking.cleanCheckpoint

28

Page 29: SMACK Dev Experience

NoSQL之Cassandra● Query-First 的 Schema設計理念

● 設計表之前需要先盡可能列出所有使用的情境

Ref: https://www.datastax.com/

Step1. 畫 ER Digram

Step2. 考慮查詢情境

Step3. 建立滿足查詢的表

29

Page 30: SMACK Dev Experience

How ever......Cassandra Out! in this project

User: Joe..........我們想要建立一個Dash Board。需要Ad-hoc Query,可以對任意欄位進行任意的操作。所以我們無法跟你討

論可能的Query呢~~

Joe:

30

Page 31: SMACK Dev Experience

Migrate NoSQL solution to ELK stacks!

New ETL with Kafka Producer

31

Page 32: SMACK Dev Experience

Logstash ingestion from Kafka

︰Bulk Loading to ES 的Throughput很低(indexing 8000 docs / min)

32

Page 33: SMACK Dev Experience

Logstash啟動Flag參數研究

https://www.elastic.co/guide/en/logstash/2.4/command-line-flags.html 33

Page 34: SMACK Dev Experience

First Step Improvement● 因為資源尚足夠,嘗試增加Workers數量與Batch Size● workers -> 20; batch -> 500

7.5X

34

Page 35: SMACK Dev Experience

LogStash需要處理多個Kafka topics時....

● Ver >= 5.0時,有topics屬性可以一次接的多個topics● Ver < 5.0 時...........需要在設定檔中逐一宣告

35

Page 36: SMACK Dev Experience

ES Side Turning for Bulk Loading● Bulk Load時幾個Trade-off的選項

○ 不Care最新資料的Latency -> 降低index.refresh_interval○ 不Care容錯與查詢速度 -> 將副本數設定為0 ○ 不Care Merge Segment佔用的IO(越快越好) ->不掐Merge IO

36

Page 37: SMACK Dev Experience

Bulk Load Improvement

● 最終結果:40萬Docs/min, 50X Throughput

50X

37

Page 38: SMACK Dev Experience

當Bulk load的很爽時........● Too many open files!

一坑還有一坑 ......

38

Page 39: SMACK Dev Experience

● 先檢查目前的max_file_descriptors然後進行設定

ES max_file_descriptors修改

39

Page 40: SMACK Dev Experience

ES search Query Turning● Optimize (force merge) cold index,甚至合成單一segment

● 使用兩類Cache提昇查詢效能

○ Filter cache:將過濾的結果cache起來,以供未來其他查詢使用

○ Shard cache:將查詢結果整個cache起來,下次一樣的查詢直接回傳

● 別忘了移除為了Bulk Load模式所做的暫時性設定

40

Page 41: SMACK Dev Experience

Filter Query v.s. Normal Query

Ref: Elasticsearch in Action41

Page 42: SMACK Dev Experience

Translate Normal Query to Filter Query

42

Page 43: SMACK Dev Experience

關於ES欄位的新增修改

● 新增全新的欄位很容易(Flexible schema )● 修改欄位的型別很麻煩!!需要reindex...

43

Page 44: SMACK Dev Experience

Kibana…..還沒踩到(時候未到?)

● Kibana 4.2版之後有Sense工具,下DSL很好用!!● $./bin/kibana plugin --install elastic/sense

44

Page 45: SMACK Dev Experience

Summary

● Discussed components versions○ Kafka: 0.8.2○ Spark: 2.0.2○ Cassandra: 3.10○ Elasticsearch: 2.4.5○ Logstash: 2.4.1○ Kibana: 4.6.4

New ETL with Kafka Producer

45