Upload
stas-kelvich
View
104
Download
0
Embed Size (px)
Citation preview
Distributed postgres.XL, XTM, MultiMaster
Stas Kelvich
I Started about a year ago.I Konstantin Knizhnik, Constantin Pan, Stas Kelvich
Cluster group in PgPro
2
Started to playing with Postgres-XC. 2ndQuadrant also had project(finished now) to port XC to 9.5.
I Fork is painful;I How can we bring functionality of XC in core?
Cluster group in PgPro
3
I Distributed transactions - nothing in-core;I Distributed planner - fdw, pg_shard, greenplum planner (?);I HA/Autofailover - can be built on top of logical decoding.
Distributed postgres
4
Achieve proper isolation between tx for multi-node transactions.Now in postgres on write tx start:
I Aquire XID;I Get list of running tx’s;I Use that info in visibility checks.
Distributed transactions
5
transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionIdipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotDatatime/tqual.c: XidInMVCCSnapshot
XTM API:vanilla
6
transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionIdipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotDatatime/tqual.c: XidInMVCCSnapshot
TransactionManager
XTM API:after patch
7
transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionIdipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotDatatime/tqual.c: XidInMVCCSnapshot
TransactionManager
pg_dtm.so
XTM API:after tm load
8
I Aquire XID centrally (DTMd, arbiter);I No local tx possible;I DTMd is a bottleneck.
XTM implementationsGTM or snapshot sharing
9
I Paper from SAP HANA team;I Central daemon is needed, but only for multi-node tx;I Snapshots -> Commit Sequence Number;I DTMd is still a bottleneck.
XTM implementationsIncremental SI
10
I XID/CSN are gathered from all nodes that participates in tx;I No central service;I local tx;I possible to reduce communication by using time (Spanner,
CockroachDB).
XTM implementationsClock-SI or tsDTM
11
XTM implementationstsDTM scalability
12
More nodes, higher probability of failure in system.Possible problems with nodes:
I Node stopped (and will not be back);I Node was down small amount of time (and we should bring it
back to operation);I Network partitions (avoid split-brain).
If we want to survive network partitions than we can have not morethan [N/2] - 1 failures.
HA/autofailover
13
Possible usage of such system:I Multimaster replication;I Tables with metainformation in sharded databases;I Sharding with redundancy.
HA/autofailover
14
By Multimaster we mean strongly coupled one, that acts as a singledatabase. With proper isolation and no merge conflicts.Ways to build:
I Global order to XLOG (Postgres-R, MySQL Galera);I Wrap each tx as distributed – allows parallelism while applying
tx.
Multimaster
15
Our implementation:I Built on top of pg_logical;I Make use of tsDTM;I Pool of workers for tx replay;I Raft-based storage for dealing with failures and distributed
deadlock detection.
Multimaster
16
Our implementation:I Approximately half of a speed of standalone postgres;I Same speed for reads;I Deals with nodes autorecovery;I Deals with network partitions (debugging right now).I Can work as an extension (if community accept XTM API in
core).
Multimaster
17