Managing replication lag

From: Jonathan Schaeffer <jonathan(dot)schaeffer(at)univ-grenoble-alpes(dot)fr>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Managing replication lag
Date: 2024-04-05 08:07:35
Message-ID: 5c154aaf-7947-4852-a1fe-f50292a755c4@univ-grenoble-alpes.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello,

I have a postgresql instance in primary/standby mode whith pgbackrest
asynchronous WAL shipping in between.

Last week I did an upgrade do postgresql-16 in debian 12.

Since then, the standby server has problems catching up with the
primary. The lag is increasing.

I published the lag graph here: https://imgur.com/a/csyBWFP

I tried to search following leads without success:
- large transactions : I identified 2 last longing transactions, killed
them, no success
- network issue: both servers are close to each other in the network
topology, no lag detected there
- hardware configuration delta: both servers store they WAL on SSD.
Standy server has a bit less RAM (312G vs 440G) and less CPU (20 vs 40),
but there is no load nor high write rate in my monitoring system.

In the past I already experienced this issue on the same instance
(postgresql 14) which went away by itself after some weeks.

Can you give me some advices on what else to check ?

Here is the postgresql configuration of the standby:

```
restore_command = 'pgbackrest --stanza=production archive-get %f "%p"'
listen_addresses = '*'
max_connections = 200
tcp_keepalives_idle = 300
shared_buffers = 64GB
temp_buffers = 16MB
work_mem = 32MB
maintenance_work_mem = 4GB
dynamic_shared_memory_type = posix
checkpoint_timeout = 900
max_wal_size = 6GB
min_wal_size = 256MB
checkpoint_completion_target = 0.9
max_wal_senders = 10
wal_keep_size = 128
max_replication_slots = 10
random_page_cost = 1.2
effective_cache_size = 48GB
log_destination = 'syslog'
logging_collector = off
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
syslog_sequence_numbers = on
syslog_split_messages = on
log_min_duration_statement = 1000
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_error_verbosity = default
log_line_prefix = 'db=%d,user=%u,app=%a,client=%h '
log_timezone = 'Europe/Paris'
track_activities = on
track_counts = on
track_functions = pl
autovacuum = on
datestyle = 'iso, mdy'
timezone = 'UTC'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
shared_preload_libraries = 'timescaledb, pg_stat_statements'
archive_mode = on
archive_command = 'pgbackrest --stanza=production archive-push %p'
archive_timeout = 30
hot_standby = on
```

Kind regards,
--
Jonathan Schaeffer
Observatoire des Sciences de l'Univers de Grenoble
Responsable technique Résif-DC
🏢 Isterre, bureau 035, 1381 rue de la Piscine 38610 GIERES
✆ +33 4 76 63 51 37

Browse pgsql-admin by date

  From Date Subject
Next Message Daulat 2024-04-05 09:54:51 pgBadger with Pgaudit
Previous Message jaya kumar 2024-04-05 06:53:46 Table DML status check per day