41
ADRIAN DOZSA DECEMBER 5, 2019 Your account balance is eventually consistent - A Postgres Active-Active story from Banking

Your account balance is eventually consistent - A Postgres

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Your account balance is eventually consistent - A Postgres

ADRIAN DOZSA

DECEMBER 5, 2019

Your account balance is eventually consistent -A Postgres Active-Active story from Banking

Page 2: Your account balance is eventually consistent - A Postgres

High Availability is critical for payment systems, as the cost of downtime for mission-critical systems in Banking is on

average $10M per hour. System availability needs are expressed as “5 9’s”. This in practice requires both single-site

and dual-site redundancy of database nodes, multi-master Active-Active processing across Data Centers, and

“Continuous Availability” at both Data Centers even when the WAN between them fails. The right balance must be

struck in the overall design between handling extremely write-intensive requests, with high throughput and low

latency, Availability under a variety of failure modes, and enforcement of “bank account balance” immediate and

eventual consistency.

This talk will present our journey to migrate a mission-critical payment system to Postgres in a multi-master Active-

Active topology, using Postgres-BDR, in a manner that addresses these challenges. This will include detailing

selection of replication modes used, transaction recovery and duplicate transaction avoidance mechanisms during

node failover, and incorporating appropriate patterns of use for replication conflict detection-and-resolution

mechanisms (particularly Conflict-Free Replication Data Types).

In conclusion we’ll see how Postgres-BDR can satisfy the demanding requirements of a mission-critical banking

application.

This is a hidden slide

Abstract

Page 3: Your account balance is eventually consistent - A Postgres

• Payments Systems

• Where we started from

• Single Site

• High Availability

• Consistency

• Dual Site

• High Availability

• Consistency

• Conclusion

Agenda

Page 4: Your account balance is eventually consistent - A Postgres

• Over a decade experience

building payments systems

• Focus on Non-Functional Requirements

(aka -ilities)

• Application Engineer/Architect, not a DBA

• ACI Worldwide

About me

Page 5: Your account balance is eventually consistent - A Postgres

About ACI Worldwide

Banks

Financial

Intermediaries

Merchants

Corporates

Customer

Segments

Deployment

Models

Platform

(ACI Data

Center)

Licensed

Solution Areas

Retail

Payments

Real-Time

Payments

Merchant

Payments

Bill

Payments

Digital

Channels

Payments

Intelligence

Americas

4,600+customers

EMEA

400+customers

Asia/Pacific

200+customers

45years in payments

5,300+organizations around

the world use ACI solutions

Page 6: Your account balance is eventually consistent - A Postgres

Payments Systems

Page 7: Your account balance is eventually consistent - A Postgres

A few defining characteristics

o Very high cost of failure ($9.3M per hour as per ITIC 2018)

o Very High-Availability needs – 5 9’s (over 86% need 5 9’s as per ITIC 2018)

o No single point of failure

o Strict consistency requirements

o Low latency (tens of ms business transaction)

o High throughput (thousands of business transactions per sec)

o Very high write ratios (up to 100%)

Payments Systems

Page 8: Your account balance is eventually consistent - A Postgres

Not all payments are equal

Low value payments High value payments

Retail payments (aka cards) Realtime and Wholesale payments

Your grocery store purchase Your grocery store paying their supplier

Payment loss or uncertainty

in rare occasions is tolerated

Payment loss or uncertainty

not tolerated

Two message payment

Authorization now + Settlement later

Single message payment

Authorization and Settlement in one

“Second chance” to fix it No “second chances”

Page 9: Your account balance is eventually consistent - A Postgres

Where we started from

Page 10: Your account balance is eventually consistent - A Postgres

Where we started from

Site A

ApplicationApplication

RAC Cluster

Golden Gate

(async)

Site B

ApplicationApplication

RAC Cluster

• Oracle deployment

• RAC for single site availability

o No single point of failure

• GoldenGate for dual-site

Active-Active availability

o Advanced conflict detection

and resolution logic

o Not all data is replicated

• 5 9’s Availability

Page 11: Your account balance is eventually consistent - A Postgres

Single Site Availability

Page 12: Your account balance is eventually consistent - A Postgres

HAProxy

Application

Replica pair

Application

Active Passive

HAProxy

Promote

Temporary outage

Page 13: Your account balance is eventually consistent - A Postgres

HAProxy

Application

Active-Active replica pair

Application

Active Active

HAProxy

Already active

No downtime

BDR

Page 14: Your account balance is eventually consistent - A Postgres

• How about in-flight transaction?

• Retry on second node

• No failed transaction

• Failure transparent to business logic

Transparent failover

Page 15: Your account balance is eventually consistent - A Postgres

• Availability ✓

• Consistency ?

Single site

Page 16: Your account balance is eventually consistent - A Postgres

• Oracle RAC

o Shared disks

o Shared memory

o Shared locks (coordination)

• Implications

o Nodes never diverge

(not even temporary)

• Postgres BDR

o Separate disks

o Separate memory

o Independent transactions

o Nodes diverge

(for a split sec, or failover)

Shared-everything vs Shared-nothing

Page 17: Your account balance is eventually consistent - A Postgres

Possible consistency failures (e.g. duplicates):

1. during failover and recovery

2. during switch-over

3. during failback

4. use of remote_write

Consistency failures

Page 18: Your account balance is eventually consistent - A Postgres

HAProxy

Application

Problem 1: failover and recovery

Application

HAProxy

BDR

-$100

sync

-$100

-$200 -$200

Duplicate transaction

Page 19: Your account balance is eventually consistent - A Postgres

• Commit uncertainty (à la Schrödinger's cat)

• No way to find what happened

• Risk of duplicates on recovery

The problem

Page 20: Your account balance is eventually consistent - A Postgres

HAProxy

Application

Solution: CAMO

Application

HAProxy

BDR

Remote first

sync

-$100-$100

No duplicates

App: Do you have the txn?

DB: Yes!

App: Ok. Nothing to do.

Did it commit?

-$100-$100

Page 21: Your account balance is eventually consistent - A Postgres

• Commit at Most Once

• Remote first

• Allows safe application retries

• Removes commit uncertainty

• Protects against duplicates

• Needs application involvement

CAMO

Page 22: Your account balance is eventually consistent - A Postgres

CAMO Performance

0

10

20

30

40

50

60

200

400

800

1,6

00

2,4

00

3,2

00

4,0

00

4,8

00

5,6

00

6,4

00

7,2

00

8,0

00

8,8

00

9,6

00

10,4

00

11,2

00

12,0

00

12,8

00

Late

nc

y [

ms

]

TPS

CAMO vs sync - 8k inserts

CAMO CAMO remote_write Sync remote_write

1

10

100

1000

200

400

800

1,6

00

2,4

00

3,2

00

4,0

00

La

ten

cy [

ms

]

TPS

CAMO vs sync - 32k inserts

CAMO CAMO remote_write Sync remote_write

Page 23: Your account balance is eventually consistent - A Postgres

Problem 2: dual node connections

HAProxy

ApplicationApplication

HAProxy

BDR

switch

key1key1

Duplicate

Idle connections

move to 2nd node

Page 24: Your account balance is eventually consistent - A Postgres

Solution: always connect to one node only

HAProxy

ApplicationApplication

HAProxy

BDR

switch

key1

No duplicates

kill all sessions

replay killed sessions

Page 25: Your account balance is eventually consistent - A Postgres

HAProxy

Application

Problem 3: failback

Application

HAProxy

BDR

key1key1

Duplicate

failback

not in-sync

Solution: site failover

replication lag

no CAMO

Page 26: Your account balance is eventually consistent - A Postgres

Problem 4: remote_write

HAProxy

ApplicationApplication

HAProxy

BDR

key1

key1

Duplicate

remote_write

key1

Page 27: Your account balance is eventually consistent - A Postgres

Solution: wait to apply

HAProxy

ApplicationApplication

HAProxy

BDR

key1

key1

No duplicates

remote_write

key1

wait to apply all

Page 28: Your account balance is eventually consistent - A Postgres

Single Site Availability

• No single point of failure

• Transparent failover

• No lost transactions

• No duplicates (or other constraint violations)

• Postgres-BDR can successfully replace Oracle RAC

Single Site

Page 29: Your account balance is eventually consistent - A Postgres

Dual Site Availability

Page 30: Your account balance is eventually consistent - A Postgres

• Need Disaster Recovery → Dual site

• Proven and fast failover → Active site

• Latency → Asynchronous replication

→ Dual site Active-Active Asynchronous replication

Dual site

Page 31: Your account balance is eventually consistent - A Postgres

Dual site topology

Sync(CAMO)

Site A

BDR Group

HAProxy

Application

Async

Site B

HAProxy

Application

Sync(CAMO)

Page 32: Your account balance is eventually consistent - A Postgres

• CAP Theorem → Availability vs Consistency

• High-Availability → Eventual consistency

• Conflict detection and resolution (timestamp based, CRDTs)

Active-Active

Page 33: Your account balance is eventually consistent - A Postgres

Timestamp based CDR

Site A Site B

T1 John Doe status A … John Doe status B … T2

T3 John Doe status B … John Doe status B … T3

T1 < T2 T1 < T2

Convergence

Page 34: Your account balance is eventually consistent - A Postgres

Disjoint updates

Site A Site B

T1 John Doe new status old address John Doe old status new address T2

T3 John Doe old status new address John Doe old status new address T3

T1 < T2 T1 < T2

Discarded update

Page 35: Your account balance is eventually consistent - A Postgres

Column level conflict resolution

Site A Site B

T1 John Doe new status old address John Doe old status new address T2

T3 John Doe new status new address John Doe new status new address T3

apply column update apply column update

Both updates retained

Page 36: Your account balance is eventually consistent - A Postgres

Amounts

Site A Site B

T1 John Doe $1200 … John Doe $900 … T2

T3 John Doe $900 … John Doe $900 … T3

T1 < T2 T1 < T2

T0 John Doe $1000 … John Doe $1000 … T0

+$200 -$100

Convergence, but wrong

Where’s my money?

Page 37: Your account balance is eventually consistent - A Postgres

Amounts with CRDTs

Site A Site B

T1 John Doe $1200 … John Doe $900 … T2

T3 John Doe $1100 … John Doe $1100 … T3

-$100 +$200

T0 John Doe $1000 … John Doe $1000 … T0

+$200 -$100

Convergence, and correct

Page 38: Your account balance is eventually consistent - A Postgres

Balances/Limits

Site A Site B

T1 John Doe $100 … John Doe $0 … T2

T3 John Doe -$100 … John Doe -$100 … T3

-$200 -$100

T0 John Doe $200 … John Doe $200 … T0

-$100 -$200

Convergence, and correct

Compromise

Page 39: Your account balance is eventually consistent - A Postgres

• We need strict global constraints

o E.g. bank liquidity

• Hard problem… for some other time… ☺

Strict limits

Page 40: Your account balance is eventually consistent - A Postgres

• Dual site multi-master Active-Active Postgres deployment using BDR

• Fully redundant and fully consistent single site

• Strong eventual consistency across sites

• 5 9’s availability

• Postgres-BDR can successfully replace Oracle RAC + GoldenGate

Conclusion

Page 41: Your account balance is eventually consistent - A Postgres

QUESTIONS?

ADRIAN DOZSA

[email protected]

linkedin.com/in/dozsa

THANK YOU

Takeaway

You can bank on Postgres ☺