Distributed Postgres

Distributed postgres.XL, XTM, MultiMaster

Stas Kelvich

I Started about a year ago.I Konstantin Knizhnik, Constantin Pan, Stas Kelvich

Cluster group in PgPro

2

Started to playing with Postgres-XC. 2ndQuadrant also had project(finished now) to port XC to 9.5.

I Fork is painful;I How can we bring functionality of XC in core?

Cluster group in PgPro

3

I Distributed transactions - nothing in-core;I Distributed planner - fdw, pg_shard, greenplum planner (?);I HA/Autofailover - can be built on top of logical decoding.

Distributed postgres

4

Achieve proper isolation between tx for multi-node transactions.Now in postgres on write tx start:

I Aquire XID;I Get list of running tx’s;I Use that info in visibility checks.

Distributed transactions

5

transam/clog.c: GetTransactionStatus SetTransactionStatus transam/varsup.c: GetNewTransactionIdipc/procarray.c: TransactionIdIsInProgress GetOldestXmin GetSnapshotDatatime/tqual.c: XidInMVCCSnapshot

XTM API:vanilla

6


TransactionManager

XTM API:after patch

7


TransactionManager

pg_dtm.so

XTM API:after tm load

8

I Aquire XID centrally (DTMd, arbiter);I No local tx possible;I DTMd is a bottleneck.

XTM implementationsGTM or snapshot sharing

9

I Paper from SAP HANA team;I Central daemon is needed, but only for multi-node tx;I Snapshots -> Commit Sequence Number;I DTMd is still a bottleneck.

XTM implementationsIncremental SI

10

I XID/CSN are gathered from all nodes that participates in tx;I No central service;I local tx;I possible to reduce communication by using time (Spanner,

CockroachDB).

XTM implementationsClock-SI or tsDTM

11

XTM implementationstsDTM scalability

12

More nodes, higher probability of failure in system.Possible problems with nodes:

I Node stopped (and will not be back);I Node was down small amount of time (and we should bring it

back to operation);I Network partitions (avoid split-brain).

If we want to survive network partitions than we can have not morethan [N/2] - 1 failures.

HA/autofailover

13

Possible usage of such system:I Multimaster replication;I Tables with metainformation in sharded databases;I Sharding with redundancy.

HA/autofailover

14

By Multimaster we mean strongly coupled one, that acts as a singledatabase. With proper isolation and no merge conflicts.Ways to build:

I Global order to XLOG (Postgres-R, MySQL Galera);I Wrap each tx as distributed – allows parallelism while applying

tx.

Multimaster

15

Our implementation:I Built on top of pg_logical;I Make use of tsDTM;I Pool of workers for tx replay;I Raft-based storage for dealing with failures and distributed

deadlock detection.

Multimaster

16

Our implementation:I Approximately half of a speed of standalone postgres;I Same speed for reads;I Deals with nodes autorecovery;I Deals with network partitions (debugging right now).I Can work as an extension (if community accept XTM API in

core).

Multimaster

17

Software

Distributed Postgres