From: | Will Roper <will(dot)roper(at)democracyclub(dot)org(dot)uk> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Logical replication fails when adding multiple replicas |
Date: | 2023-03-20 12:46:19 |
Message-ID: | CA+xc_dvgRA2tA82_=Ek2aM+A7q3d0c+xfki-028HsOVn4ec8jg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello list,
We’re having some issues with Postgresql’s logical replication.
Specifically trying to add several replicas at once. Essentially we can add
replicas one at a time, but when we try and add two or more together some
of the table subscriptions (as described in pg_subscription_rel) fail to
get to ‘STATE_READY’ (‘r’).
Here’s a bit more detail.
How the problem manifests/what we observe
When we try to add several replicas at the same time most of them have at
least some tables/subscriptions (in the pg_subscription_rel) that get stuck
in STATE_FINISHEDCOPY and never reach STATE_READY. The tables have the
right number of rows, but the logs show postgres is in a constant cycle of
errors like:
2023-03-18 16:00:19.974 UTC [4207] LOG: logical replication table
synchronization worker for subscription
"polling_stations_01c7ee00f17021f94", table "uk_geo_utils_onspd" has started
2023-03-18 16:00:19.999 UTC [4207] ERROR: could not start WAL streaming:
ERROR: replication slot "pg_37982_sync_37918_7210774007126708177" does not
exist
This is more likely to happen with larger tables, but can happen with any
we are replicating.
An example of the logs we’re seeing on the server at the same time are:
2023-03-18 16:00:20
UTC:172.31.14.73(49884):postgres(at)polling_stations:[5952]:LOG:
00000: received replication command: START_REPLICATION SLOT
"pg_37982_sync_37702_7210774007126708177" LOGICAL 187/E00020E8
(proto_version '2', streaming 'on', publication_names '"alltables"', binary
'true')
2023-03-18 16:00:20
UTC:172.31.14.73(49884):postgres(at)polling_stations:[5952]:LOCATION:
exec_replication_command, walsender.c:1683
2023-03-18 16:00:20
UTC:172.31.14.73(49884):postgres(at)polling_stations:[5952]:STATEMENT:
START_REPLICATION SLOT "pg_37982_sync_37702_7210774007126708177" LOGICAL
187/E00020E8 (proto_version '2', streaming 'on', publication_names
'"alltables"', binary 'true')
2023-03-18 16:00:20
UTC:172.31.14.73(49884):postgres(at)polling_stations:[5952]:ERROR:
42704: replication slot "pg_37982_sync_37702_7210774007126708177" does not
exist
2023-03-18 16:00:20
UTC:172.31.14.73(49884):postgres(at)polling_stations:[5952]:LOCATION:
ReplicationSlotAcquire, slot.c:402
When we add a single replica at a time, the replica starts up without any
problems.
What we’ve tried
Essentially we have tried fiddling with lots of different conf settings.
I’ve included these below showing the setting, the value we’re using, and a
brief sentence about why.
What we think is going on
We don’t know - hence the question :). However it is something resulting
from concurrent replication, and as such we assume it’s to do with how we
have configured the primary, or how Postgres handles logical replication,
rather than an issue with how we’re configuring the subscriber.
Some general context and observations
-
Postgres 14.7 on both server (RDS) and clients (EC2 instances). Client
servers running Ubuntu 22.04
-
20-40gb database with <100 tables that lives in RDS. We spin up EC2
instances to scale our django application. Since our app is read heavy we
have a local postgres install on each EC2 instance, which replicates from
the Primary in RDS. Once replication is complete our app routes all reads
to the local db, and all writes (the few that there are) to RDS.
-
We drop indexes on the larger tables before the initial copy, and
rebuild them after, but this is not limited to those tables where we’ve
dropped the indexes.
-
No writes to the tables that cause problems over the time that we’re
replicating (e.g, this is static data)
-
Logical Replication works just fine when we start up a single replica.
-
We’ve seen it work at least once when starting up 2 replicas
-
We’ve never seen it work when trying to add more than 2 replicas -
sometimes one gets to STATE_READY, but usually none of them.
-
We are able to add replicas one at a time, and this works fine, the
issue is when we’re adding them in parallel.
-
We’re using PostGIS, but this effects non-spatial tables
-
This doesn’t happen when we have a small ‘test’ database as the
publisher - so it’s hard to give a locally reproducible example of the data.
I’d be really grateful for any pointers, or even similar experiences as
we’ve been banging our heads against this for a few days now!
Many thanks
Will
Publisher settings
Max_replication_slots: 100
We don’t expect to see more than this, and haven’t ever seen anything more
than ~30 when scaling up to 10 instances (2 table sync workers per instance
and one slot made as part of the subscription)
Max_connections: 100
Not expecting more than this to the RDS, have never seen problems with too
many connections
Work_mem: 512
Make sure we don’t run out of this. System memory isn’t filling up at all
Shared_buffers: {DBInstanceClassMemory/20480}
40% of memory
Max_logical_replication_workers: 200,
Max_replication_slots: 400,
Max_worker_processes: 200
Just a higher number than we expect to get to make sure this isn’t what
we’re hitting
Max_slot_wal_keep_size: 5000
Satey to prevent the RDS from filling up with unsynced WALs. We’ve never
hit this limit, mainly because we don’t write a lot at all, and instances
clean up their slots on shutdown.
Max_wal_size: 1028
(default postgres value, not sure what the RDS default is)
Rds.logical_replication: 1
Enables replication in RDS. RDS only setting
Shared_preload_libraries
Enables replication in RDS
Checkpoint_timeout: 1200
20 minutes. Set to see if a timeout was happening during the COPY phase
(this takes less than 20 minutes)
Deadlock_timeout: 30000
Raised from 1 second to 30 to see if deadlock checking was slowing down or
making any race conditions worse.
Log_connections: 1,
log_disconnections: 1,
log_duration: 1,
Log_executor_stats: 1,
Log_lock_waits: 1,
Log_replication_commands: 1
Log_error_verbosity: verbose,
RDS debugging
Max_locks_per_transaction: 128
Doubled to see if this was being hit for some reason
Max_slot_wal_keep_size: 40000,
Max_wal_senders: 400,
Max_wal_size: 512,
Wal_keep_size: 8192,
Wal_sender_timeout: 12000
Trying to change various things with WAL settings, not really knowing why 🙁
Tcp_keepalives_idle: 60
Thinking the network connection might be getting killed when syncing large
tables. Unlikely as it takes about the same time to sync a single instance
and that works fine.
Subscriber/Instance settings
Wal_receiver_create_temp_slot: 1
Primary_slot_name: same as subscription
Wal_receiver_timeout: 1200000
(plus more as per
https://github.com/DemocracyClub/UK-Polling-Stations/blob/master/deploy/files/scripts/db_replication_runner.sh
)
From | Date | Subject | |
---|---|---|---|
Next Message | Dominique Devienne | 2023-03-20 12:50:31 | Re: NULL pg_database.datacl |
Previous Message | Erik Wienhold | 2023-03-20 12:18:56 | Re: NULL pg_database.datacl |