Upload
raghavendra-prabhu
View
250
Download
3
Embed Size (px)
DESCRIPTION
This is the talk given at Highload++ 2014 in Moscow, Russia. The topic was partition tolerance testing of Galera in a noisy high load environment with NetEm and Docker.
Citation preview
Corpus collapsumPartition tolerance of Galera in a noisy high load
environmentHighload++ 2014
Raghavendra Prabhu [email protected]
Percona [email protected] randomsurfer wnohang.net rdprabhu ronin13
The Title?
Our Cluster
Split brain
Seed quotes..
“ ’Network is reliable’ - a fallacy of the distributed system. ” - PeterDeutsch
“ A distributed system is one in which the failure of a computer you didn’teven know existed can render your own computer unusable. ” - Leslie Lamport
“ Never attribute to malice that which is adequately explained by stupidity.” - Hanlon’s Razor
“ Never attribute to Byzantine failure which can be explained by an illnode(s) ” - Me
5 / 92
Seed quotes..
“ ’Network is reliable’ - a fallacy of the distributed system. ” - PeterDeutsch
“ A distributed system is one in which the failure of a computer you didn’teven know existed can render your own computer unusable. ” - Leslie Lamport
“ Never attribute to malice that which is adequately explained by stupidity.” - Hanlon’s Razor
“ Never attribute to Byzantine failure which can be explained by an illnode(s) ” - Me
6 / 92
Seed quotes..
“ ’Network is reliable’ - a fallacy of the distributed system. ” - PeterDeutsch
“ A distributed system is one in which the failure of a computer you didn’teven know existed can render your own computer unusable. ” - Leslie Lamport
“ Never attribute to malice that which is adequately explained by stupidity.” - Hanlon’s Razor
“ Never attribute to Byzantine failure which can be explained by an illnode(s) ” - Me
7 / 92
Seed quotes..
“ ’Network is reliable’ - a fallacy of the distributed system. ” - PeterDeutsch
“ A distributed system is one in which the failure of a computer you didn’teven know existed can render your own computer unusable. ” - Leslie Lamport
“ Never attribute to malice that which is adequately explained by stupidity.” - Hanlon’s Razor
“ Never attribute to Byzantine failure which can be explained by an illnode(s) ” - Me
8 / 92
The fallacies
▶ The network is reliable▶ Latency is zero▶ Bandwidth is infinite▶ The network is secure
9 / 92
The fallacies
▶ Topology doesn’t change▶ There is one administrator▶ Transport cost is zero▶ The network is homogeneous
10 / 92
20000 feet view
Actors
▶ Database - WSREP/PXC▶ Plugin - Galera▶ Traffic control
♦ Traffic Control - tc♦ NetEm
12 / 92
Actors
▶ Database - WSREP/PXC▶ Plugin - Galera▶ Traffic control
♦ Traffic Control - tc♦ NetEm
13 / 92
Actors
▶ Database - WSREP/PXC▶ Plugin - Galera▶ Traffic control
♦ Traffic Control - tc♦ NetEm
14 / 92
Actors
▶ Containers - Docker▶ Load
♦ Generators - Sysbench, RQG▶ Network
♦ Dnsmasq♦ nsenter
15 / 92
Actors
▶ Containers - Docker▶ Load
♦ Generators - Sysbench, RQG▶ Network
♦ Dnsmasq♦ nsenter
16 / 92
Actors
▶ Jenkins♦ Build flow and CI
▶ Storage♦ Why
17 / 92
But why
▶ The ‘P’ in CAP▶ WAN scalability▶ Real Reason - fun!▶ Tolerance to latency variance
18 / 92
But why
▶ The ‘P’ in CAP▶ WAN scalability▶ Real Reason - fun!▶ Tolerance to latency variance
19 / 92
But why
▶ The ‘P’ in CAP▶ WAN scalability▶ Real Reason - fun!▶ Tolerance to latency variance
20 / 92
But why
▶ The ‘P’ in CAP▶ WAN scalability▶ Real Reason - fun!▶ Tolerance to latency variance
21 / 92
But why
▶ Failures in warehouses.▶ Not quorum, but consensus.▶ Real world networks and synchronous replication
- Delay- Partition
22 / 92
Galera
Galera
▶ Data-centric approach▶ EVS▶ Causality and Synchronous▶ Latency
24 / 92
Where did it start
Where did it start
▶ Bug! https://bugs.launchpad.net/galera/+bug/1274192▶ Loss of PC▶ Crash▶ HA goal
29 / 92
One can bring the whole down
The Flow
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
32 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
33 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
34 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
35 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
36 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
37 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
38 / 92
Basic Flow
Jenkins Build image(s?) Start Dnsmasq Bootstrap
Load/SysbenchSST/OthersPre-sanitynsenter/netem
39 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
40 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
41 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
42 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
43 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
44 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
45 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
46 / 92
Basic Flow
RR sysbench
Detach/Keep
Sanity check Reconciliation
Post sanity Core trace
Cleanup Collect logs
Examine!
47 / 92
Cluster Resilience
48 / 92
Parameters
▶ Sysbench▶ Segments▶ Reconciliation period▶ Number of nodes with loss
49 / 92
Parameters
▶ Sysbench▶ Segments▶ Reconciliation period▶ Number of nodes with loss
50 / 92
Parameters
▶ Sysbench▶ Segments▶ Reconciliation period▶ Number of nodes with loss
51 / 92
Parameters
▶ Sysbench▶ Segments▶ Reconciliation period▶ Number of nodes with loss
52 / 92
Parameters
▶ NetEm▶ Detach qdisc or keep it?▶ Fsync▶ Shutdown
53 / 92
Parameters
▶ NetEm▶ Detach qdisc or keep it?▶ Fsync▶ Shutdown
54 / 92
Parameters
▶ NetEm▶ Detach qdisc or keep it?▶ Fsync▶ Shutdown
55 / 92
Parameters
▶ NetEm▶ Detach qdisc or keep it?▶ Fsync▶ Shutdown
56 / 92
Parameters
Containers!
Docker
▶ Why not virtualize♦ Occam♦ Namespaces
▶ Simplicity♦ Network♦ One application per node
59 / 92
Docker
▶ Portability- See same qualitative behavior that I do.
▶ Reproducibility- Makes it determinstic
▶ Configurable and CI- Byproducts
60 / 92
Docker
▶ QEMU and Docker▶ Scalability
♦ Performance♦ Feature
▶ Abstraction of channels
61 / 92
Container Networking
▶ Linking didn’t help▶ Dnsmasq to rescue!
♦ Hosts file and volumes♦ SIGHUP and refresh
▶ More elegant methods- Swarm
62 / 92
Noise
▶ Initial setup- Bridge- Egress only- IFB
▶ Present state▶ NetEm
- tc qdisc buckets- packet loss, delay, corruption, duplication, reordering- nsenter
▶ Future- Docker exec
63 / 92
Testing methods
Method I
▶ Qdisc is detached after load▶ Objective
- Time to recover of full cluster▶ Done with a larger subset
65 / 92
Method II
▶ Qdisc is kept till the end▶ Objective
- Formation of primary component▶ Comparatively smaller subset
66 / 92
Observations
▶ Post sanity types- Why- Whom
▶ Which method is more pertinent▶ State transfer issues
- Beginning- During re-emergence
67 / 92
Observations
▶ Direct load to affected nodes▶ Logs
- journalctl- Streaming?
68 / 92
Other noises
▶ Aim▶ Fsync
- libeatmydata- Variance
▶ Correlation with network▶ How with Docker
- LD_PRELOAD
69 / 92
System Load
Load generation
▶ Sysbench- Generation- Reconnect on partition
▶ Sockets chosen- Load on affected nodes
▶ Distribution of Load- RR with socat- Native sysbench support- HAProxy?
71 / 92
Load generation
▶ Nature of data/load- DDL
▶ RQG in future- Fuzz testing
72 / 92
The Fix
Strike Out!
Eviction
▶ STONITH▶ Permanent eviction▶ ’N’ strikes & out!
- Timers - evs parameters- wsrep_evs_delayed and wsrep_evs_evict_list
75 / 92
Eviction
▶ Aim▶ Quorum required
- Keep majority of the group live or to avoid eviction.- Why? - Not shoot each other- Non-PC nodes also.
76 / 92
Eviction
▶ Aim▶ Quorum required
- Keep majority of the group live or to avoid eviction.- Why? - Not shoot each other- Non-PC nodes also.
77 / 92
Eviction
▶ EVS version and upgrade▶ TODO!
- Ingress only- Follow here.
▶ Credits to Teemu Ollakka, Yan Zhang and Alex Yurchenko from codership.
78 / 92
Coredumps with Docker
▶ Breakdown of abstraction▶ Lack of isolation▶ What was done
- Volumes- core_pattern & sysctl- suid and ulimit
79 / 92
WAN Segments
▶ How they work▶ Random allocation▶ Joiner starvation▶ Simulates data center▶ Donor selection
80 / 92
The code
▶ Github: https://github.com/percona/pxc-docker▶ Jenkins: http://jenkins.percona.com/job/PXC-5.6-netem/▶ Contributions/testing welcome!▶ Dependencies
- Sysbench
81 / 92
Code: todo
▶ Docker automated builds▶ Orchestration
- Kubernetes, Mesos et.al.▶ Docker
♦ Injection♦ Signal proxying
82 / 92
Code: todo
▶ Poll and exit during reconciliation- Instrument v/s number of nodes
▶ Use actual channels▶ Run it bare - CoreOS etc.▶ Overlay with etcd/fleet/libswarm
83 / 92
Future work
Future work
▶ Fault injection♦ Memory
- Poisoned memory/madvise♦ Disk
- libeatmydata- Opposite- ENOSPC
85 / 92
Fault injection
▶ CPU- NUMA?- Hotplug
▶ More network- corruption, duplication, reordering, rate-limit- Better distribution- Other shaping
86 / 92
More Chaos
Future work
▶ Disturb cluster more!- Membership changes* Manual eviction* Pull the cord!- Corrupt nodes
▶ Consistency voting
88 / 92
Further Reading
▶ Jepsen▶ Byzantine fault tolerance
- Reaching agreement in presence of faults▶ The Network is Reliable▶ NetEm▶ Latency: The New Web Performance Bottleneck▶ Galera Cluster Documentation▶ Auto eviction code▶ Don’t Settle for Eventual Consistency▶ Extended Virtual Synchrony
89 / 92
About
▶ /me: Raghavendra Prabhu, Product Lead, Percona XtraDB Cluster, Percona.▶ Slides will be at slideshare.▶ About.me: raghavendra.prabhu▶ Keybase.io: rdprabhu▶ Presentation under CC BY-SA 4.0
90 / 92
Image Credits▶ http://galeracluster.com/documentation-webpages/▶ http://www.thelastdragontribute.com/40th-anniversary-death-of-bruce-lee/▶ https://upload.wikimedia.org/wikipedia/commons/6/60/Corpus_callosum.png▶ http://www.thebarrow.org/Neurological_Services/Epilepsy/204354▶ https://flic.kr/p/9J6GNu▶ https://secure.flickr.com/photos/brewbooks/7780990192▶ https://www.flickr.com/photos/kwerfeldein/2649294869▶ https://secure.flickr.com/photos/mindmob/51951632▶ https://secure.flickr.com/photos/arenamontanus/2227769907▶ https://www.flickr.com/photos/markop/477199204▶ http://galeracluster.com/wp-content/uploads/2013/10/galera_replication1.png▶ https://www.flickr.com/photos/gcwest/281385801▶ https://www.flickr.com/photos/opethdamna/360934079▶ http://digital-amphetamine.deviantart.com/art/Sky-82555664▶ http://highload.co/i/logo.png▶ https://flic.kr/p/xTT8n▶ https://www.flickr.com/photos/29233640@N07/13466208953▶ https://www.flickr.com/photos/bob_in_thailand/9782777742/
91 / 92
Spaseeba & Da sveedaneeya!