47
Troubleshooting streaming replication Alexey Lesovsky [email protected]

Troubleshooting PostgreSQL Streaming Replication

Embed Size (px)

Citation preview

Page 1: Troubleshooting PostgreSQL Streaming Replication

Troubleshootingstreaming replication

Alexey Lesovsky

[email protected]

Page 2: Troubleshooting PostgreSQL Streaming Replication

dataegret.com

Quick introduction• WAL and replication internals.• Replication setup.

Troubleshooting tools• 3rd party tools.• Builtin tools.

Troubleshooting cases• Symptoms and problems.• Detection and solutions.• Lessons learned.

02

03

01

Page 3: Troubleshooting PostgreSQL Streaming Replication

Quick introduction

01

Page 4: Troubleshooting PostgreSQL Streaming Replication

Goals01

dataegret.com

Better understanding of streaming replication.

How to quickly find and fix problems.

https://goo.gl/Mm3ugt

Page 5: Troubleshooting PostgreSQL Streaming Replication

Agenda01

dataegret.com

Write-Ahead Log.

Streaming replication internals.

Streaming replication setup.

Troubleshooting tools overview (3-rd party).

Troubleshooting tools overview (builtin).

Troubleshooting in practice.

Questions.

Page 6: Troubleshooting PostgreSQL Streaming Replication

Write Ahead Log01

dataegret.com

Durability in ACID.

Almost all changes fixed in WAL.

pg_xlog/ (pg_wal/) directory in the DATADIR.

Synchronous WAL write by backends.

Asynchronous WAL write by WAL writer.

Recovery process relies on WAL.

Page 7: Troubleshooting PostgreSQL Streaming Replication

Streaming replication internals01

dataegret.com

WAL Sender process.

WAL Receiver process.

Startup process (recovery).

Streaming replication vs. WAL archiving.

Page 8: Troubleshooting PostgreSQL Streaming Replication

Streaming replication internals01

dataegret.com

WALBuffers Storage

WALSender

Network

StartupProcessStorage

WALReceiver

Page 9: Troubleshooting PostgreSQL Streaming Replication

Streaming replication setup01

dataegret.com

Master:

● postgresql.conf;

● Restart.

Standby:

● pg_basebackup;

● postgresql.conf;

● recovery.conf setup.

Page 10: Troubleshooting PostgreSQL Streaming Replication

Master setup01

dataegret.com

wal_level, max_wal_senders, max_replication_slots.

archive_mode, archive_command.

wal_keep_segments.

wal_sender_timeout.

synchronous_standby_names.

Page 11: Troubleshooting PostgreSQL Streaming Replication

Master setup pitfalls01

dataegret.com

wal_level, archive_mode, max_wal_senders, max_replication_slots – require restart.

wal_keep_segments – requires extra storage space.

wal_sender_timeout – reduce that, if network is bad.

synchronous_standby_names – master freezes if standby fails.

Page 12: Troubleshooting PostgreSQL Streaming Replication

Standby setup01

dataegret.com

hot_standby.

max_standby_streaming_delay, max_standby_archiving_delay.

hot_standby_feedback.

wal_receiver_timeout.

Page 13: Troubleshooting PostgreSQL Streaming Replication

Standby setup pitfalls01

dataegret.com

hot_standby – enables SELECT queries.

max_standby_streaming_delay – increases max possible lag.

hot_standby_feedback:

● postpones vacuum;

● potential tables/indexes bloat.

wal_receiver_timeout – reduce that, if network is bad.

Page 14: Troubleshooting PostgreSQL Streaming Replication

Recovery.conf01

dataegret.com

primary_conninfo and/or restore_command.

standby_mode.

trigger_file.

Recovery target:

● immediate;

● particular point/xid/timestamp;

● recovery_min_apply_delay.

Page 15: Troubleshooting PostgreSQL Streaming Replication

Recovery.conf pitfalls01

dataegret.com

Any changes require restart.

Page 16: Troubleshooting PostgreSQL Streaming Replication

Troubleshooting tools

02

Page 17: Troubleshooting PostgreSQL Streaming Replication

3rd party troubleshooting tools02

dataegret.com

Top (procps).

Iostat (sysstat), iotop.

Nicstat.

pgCenter.

Perf.

Page 18: Troubleshooting PostgreSQL Streaming Replication

3rd party troubleshooting tools02

dataegret.com

Top (procps) – CPU usage, load average, mem/swap usage.

Iostat (sysstat), iotop – storage utilization, process IO.

Nicstat – network interfaces utilization.

pgCenter – replication stats.

Perf – deep investigations.

Page 19: Troubleshooting PostgreSQL Streaming Replication

Builtin troubleshooting tools02

dataegret.com

Statistics views.

Auxiliary functions.

pg_xlogdump utility.

Page 20: Troubleshooting PostgreSQL Streaming Replication

Builtin troubleshooting tools02

dataegret.com

Statistics view:

● pg_stat_replication;

● pg_stat_databases, pg_stat_databases_conflicts;

● pg_stat_activity;

● pg_stat_archiver.

Page 21: Troubleshooting PostgreSQL Streaming Replication

Builtin troubleshooting tools02

dataegret.com

Auxiliary functions:

● pg_current_xlog_location, pg_last_xlog_receive_location;

● pg_xlog_location_dif;

● pg_xlog_replay_pause, pg_xlog_replay_resume;

● pg_is_xlog_replay_paused.

Page 22: Troubleshooting PostgreSQL Streaming Replication

Builtin troubleshooting tools02

dataegret.com

pg_xlogdump:

● Decodes and displays XLOG for debugging;

● Can give wrong results when the server is running.

pg_xlogdump -f -p /xlog_96 \

$(psql -qAtX -c "select pg_xlogfile_name(pg_current_xlog_location())")

Page 23: Troubleshooting PostgreSQL Streaming Replication

Troubleshooting cases

03

Page 24: Troubleshooting PostgreSQL Streaming Replication

Troubleshooting cases03

dataegret.com

Replication lag.

pg_xlog/ bloat.

Long transactions and recovery conflicts.

Recovery process: 100% CPU usage.

Page 25: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

Main symptom – answers differ between master and standbys.

Detection:

● pg_stat_replication and pg_xlog_location_dif();

● pg_last_xact_replay_timestamp().

Page 26: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

# \d pg_stat_replication View "pg_catalog.pg_stat_replication" Column | Type | Modifiers ------------------+--------------------------+----------- pid | integer | usesysid | oid | usename | name | application_name | text | client_addr | inet | client_hostname | text | client_port | integer | backend_start | timestamp with time zone | backend_xmin | xid | state | text | sent_location | pg_lsn | write_location | pg_lsn | flush_location | pg_lsn | replay_location | pg_lsn | sync_priority | integer | sync_state | text |

Page 27: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

# SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lagFROM pg_stat_replication;

сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638

Page 28: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

# SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lagFROM pg_stat_replication;

сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638

Page 29: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

# SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lagFROM pg_stat_replication;

сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638

Page 30: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

# SELECT client_addr AS client, usename AS user, application_name AS name, state, sync_state AS mode, (pg_xlog_location_diff(pg_current_xlog_location(),sent_location) / 1024)::int as pending, (pg_xlog_location_diff(sent_location,write_location) / 1024)::int as write, (pg_xlog_location_diff(write_location,flush_location) / 1024)::int as flush, (pg_xlog_location_diff(flush_location,replay_location) / 1024)::int as replay, (pg_xlog_location_diff(pg_current_xlog_location(),replay_location))::int / 1024 as total_lagFROM pg_stat_replication;

сlient | user | name | state | mode | pending | write | flush | replay | total_lag ----------+--------+-------------+-----------+-------+---------+-------+-------+--------+----------- 10.6.6.9 | repmgr | walreceiver | streaming | async | 0 | 0 | 0 | 410480 | 410480 10.6.6.7 | repmgr | walreceiver | streaming | async | 0 | 2845 | 95628 | 112552 | 211025 10.6.6.6 | repmgr | walreceiver | streaming | async | 0 | 0 | 3056 | 9496 | 12552 10.6.6.8 | repmgr | walreceiver | streaming | async | 847582 | 0 | 0 | 3056 | 850638

Page 31: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

Network problems – nicstat.

Storage problems – iostat, iotop.

Recovery stucks – top, pg_stat_activity.

WAL pressure:

● pg_stat_activity, pg_stat_progress_vacuum;

● pg_xlog_location_dif().

Page 32: Troubleshooting PostgreSQL Streaming Replication

Replication lag03

dataegret.com

Network/storage problems:

● check workload;

● upgrade hardware.

Recovery stucks – wait or cancel queries on standby.

WAL pressure:

● Reduce amount of work;

● Reduce amount of WAL:

● full_page_writes = of, wal_compression = on, wal_log_hints = of;

● expand interval between checkpoints.

Page 33: Troubleshooting PostgreSQL Streaming Replication

pg_xlog/ bloat03

dataegret.com

Main symptoms:

● unexpected increase in the usage of the disk space;

● abnormal size of pg_xlog/ directory.

Page 34: Troubleshooting PostgreSQL Streaming Replication

pg_xlog/ bloat03

dataegret.com

Detection:

● du -csh;

● pg_replication_slots, pg_stat_archiver;

● errors in postgres logs.

Page 35: Troubleshooting PostgreSQL Streaming Replication

pg_xlog/ bloat03

dataegret.com

Problems:

● Massive CRUD.

● Unused slot.

● Broken archive_command.

Page 36: Troubleshooting PostgreSQL Streaming Replication

pg_xlog/ bloat03

dataegret.com

Solutions:

● check replication lag;

● reduce checkpoints_segments/max_wal_size, wal_keep_segments;

● change reserved space ratio (ext filesystems);

● add an extra space (LVM, ZFS, etc);

● drop unused slot or fix slot consumer;

● fix WAL archiving;

● checkpoint, checkpoint, chekpoint.

Page 37: Troubleshooting PostgreSQL Streaming Replication

Recovery conflicts03

dataegret.com

Main symptoms – errors in postgresql or application logs.

postgres.c:errdetail_recovery_conflict():● User was holding shared bufer pin for too long.

● User was holding a relation lock for too long.

● User was or might have been using tablespace that must be dropped.

● User query might have needed to see row versions that must be removed.

● User transaction caused bufer deadlock with recovery.

● User was connected to a database that must be dropped.

Page 38: Troubleshooting PostgreSQL Streaming Replication

Recovery conflicts03

dataegret.com

Detection:

● pg_stat_databases + pg_stat_databases_conflicts;

● postgresql logs.

Page 39: Troubleshooting PostgreSQL Streaming Replication

Recovery conflicts03

dataegret.com

Problems:

● queries are cancelled too often;

● long transactions on a standby – check pg_stat_activity;

● huge apply lag – check pg_stat_replication.

Page 40: Troubleshooting PostgreSQL Streaming Replication

Recovery conflicts03

dataegret.com

Solutions:

● increase streaming delay (potentially causes lag);

● enable hot_standby_feedback (potentially causes bloat);

● rewrite queries;

● setup dedicated standby for long queries.

Page 41: Troubleshooting PostgreSQL Streaming Replication

Recovery 100% CPU usage03

dataegret.com

Main symptoms:

● huge apply lag;

● 100% CPU usage by recovery process.

Page 42: Troubleshooting PostgreSQL Streaming Replication

Recovery 100% CPU usage03

dataegret.com

Detection:

● top – CPU usage;

● pg_stat_replication – amount of lag.

Page 43: Troubleshooting PostgreSQL Streaming Replication

Recovery 100% CPU usage03

dataegret.com

Investigation:

● perf top/record/report (required debug symbols);

● pg_xlogdump.

Page 44: Troubleshooting PostgreSQL Streaming Replication

Recovery 100% CPU usage03

dataegret.com

Solutions:

● depend on investigation' results;

● change problematic workload (if found).

Page 45: Troubleshooting PostgreSQL Streaming Replication

Lessons learned03

dataegret.com

Streaming replication problems are always distributed.

There are many sources of problems:

● system resources, app/queries, workload.

Always use monitoring.

Learn how to use builtin tools.

Page 46: Troubleshooting PostgreSQL Streaming Replication

Links

dataegret.com

PostgreSQL official documentation – The Statistics Collectorhttps://www.postgresql.org/docs/current/static/monitoring-stats.html

PostgreSQL Mailing Lists (general, performance, hackers)https://www.postgresql.org/list/

PostgreSQL-Consulting company bloghttp://blog.postgresql-consulting.com/

Page 47: Troubleshooting PostgreSQL Streaming Replication

Thanks for watching!

dataegret.com [email protected]