Re: WAL accumulating, Logical Replication pg 13

From: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To: Willy-Bas Loos <willybas(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: WAL accumulating, Logical Replication pg 13
Date: 2021-05-29 15:16:44
Message-ID: CAM+6J9702uzkM3i9VVTJiw_L3-aD3XDRuf4V8bJKrEucZaTxgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

WAL can be built up for reasons like
1) if you have an inactive replication slot. I mean you had a streaming
replica which was turned off, but you did not remote the slot from primary.
2) Do you have archiving enabled? Are the archiving commands running fine ?
if just the archiving is broken, then you can manually run archive cleanup
provided, replication is all caught up fine.

3) logical replication can be broken for multiple reasons, like conflicts
where the subscriber already has the data which primary wants to push. it
will not proceed until the conflicts are resolved.
4) poor connectivity or the computer/network resources not able to keep up
with the load, can result in WAL pile up.

there are many blogs around logical replication issues, but when it was new
in pg10, I read this.
Recovery use cases for Logical Replication in PostgreSQL 10 | by Konstantin
Evteev | AvitoTech | Medium
<https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072>

btw,
how are you doing logical replication with 9.3 ? using a pglogical
extension ?
I can try many things, but it would be wrong to make assumptions since i
did not work with 9.3
for ex.
Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
Request #295 · 2ndQuadrant/pglogical (github.com)
<https://github.com/2ndQuadrant/pglogical/pull/295>
there are many issues posted here that may be relevant to your setup.

On Sat, 29 May 2021 at 19:22, Willy-Bas Loos <willybas(at)gmail(dot)com> wrote:

> Yeah, indexes could slow things down, thanks. Btw I'm not using logical
> replication for the upgrade, that's not supported for 9.3.
> It was more complicated but that's beside the point.
>
> I could just delete the publication and all that belongs to it and start
> over. But since I'm trying out logical replication, I would like to be more
> in control than that. It's there anything that I can dig into to find out
> why the WAL is accumulating?
>
> Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
> vijaykumarjain(dot)github(at)gmail(dot)com>:
>
>> I am not too sure with 9.3
>> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
>> extension)
>>
>> one thing to note.
>> logical replication initiates a copy from a snapshot, then changes from
>> then on.
>>
>> I had a very high insert rate on my source tables (v9.6) and the
>> destination (v11) could not keep up (it had tons of indexes when I copied
>> the schema) and it took around a day as the table had around 12 indexes.
>>
>> So at the destination(v11), I dropped all but the primary index for each
>> table, started subscription and when it was almost caught up, rebuilt the
>> index on the destination concurrently.
>> it completed in 4-5 hours without stopping the source.
>> migration completed in a few mins :)
>>
>> not sure if this would help, but just FYI.
>>
>>
>> On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas(at)gmail(dot)com> wrote:
>>
>>> Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
>>> Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
>>> problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
>>> and my disks are getting full. The oldest WAL file is 18 days old.
>>> I use Logical Replication from the new cluster to another new cluster
>>> with 1 subscriber and 1 subscription.
>>>
>>> pg_stat_subscription tells me all recent timestamps.
>>> and this:
>>> db=# select * from pg_replication_slots;
>>> -[ RECORD 1 ]-------+-------------
>>> slot_name | my_pub1
>>> plugin | pgoutput
>>> slot_type | logical
>>> datoid | 16401
>>> database | db
>>> temporary | f
>>> active | t
>>> active_pid | 9480
>>> xmin |
>>> catalog_xmin | 269168
>>> restart_lsn | D4/908BC268
>>> confirmed_flush_lsn | E1/25BF5710
>>> wal_status | extended
>>> safe_wal_size |
>>>
>>>
>>>
>>> I've had problems with diskspace on this server, with postgres crashing
>>> because of it, then added more diskspace and postgres recovered. This
>>> doesn't seem to be a problem now.
>>>
>>> The *publication* has the options publish = 'insert, update, delete,
>>> truncate', publish_via_partition_root = false
>>> The *subscription* has the options connect = true, enabled = true,
>>> create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'
>>>
>>> The log on the publisher says:
>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db LOG: starting logical
>>> decoding for slot "my_pub1"
>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db DETAIL: Streaming
>>> transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db LOG: logical decoding
>>> found consistent point at D4/908BC268
>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db DETAIL: There are no
>>> running transactions.
>>> 2021-05-25 21:29:49.456 CEST [4614] user(at)db ERROR: replication slot
>>> "my_pub1" is active for PID 4584
>>> 2021-05-25 21:29:54.474 CEST [4615] user(at)db ERROR: replication slot
>>> "my_pub1" is active for PID 4584
>>>
>>> And on the subscriber:
>>> 2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
>>> worker for subscription "my_pub1" has started
>>> 2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
>>> streaming: ERROR: replication slot "my_pub1" is active for PID 730
>>> 2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
>>> replication worker" (PID 40039) exited with exit code 1
>>>
>>> The postgres settings on the *publisher* are:
>>> max_connections = 100 # (change requires restart)
>>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>>> shared_buffers = 50GB # min 128kB
>>> work_mem = 1GB # min 64kB
>>> maintenance_work_mem = 10GB # min 1MB
>>> logical_decoding_work_mem = 5GB # min 64kB
>>> dynamic_shared_memory_type = posix # the default is the first option
>>> max_worker_processes = 20 # (change requires restart)
>>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
>>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
>>> max_parallel_workers = 15 # maximum number of max_worker_processes that
>>> wal_level = logical # minimal, replica, or logical
>>> max_wal_size = 1GB
>>> min_wal_size = 80MB
>>> #archive_mode = off
>>> max_wal_senders = 10 # max number of walsender processes
>>> wal_sender_timeout = 1min # in milliseconds; 0 disables
>>> max_replication_slots = 7 # max number of replication slots
>>>
>>> On postgres settings on the *subscriber*:
>>> max_connections = 100 # (change requires restart)
>>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>>> shared_buffers = 25GB # min 128kB
>>> work_mem = 1GB # min 64kB
>>> maintenance_work_mem = 10GB # min 1MB
>>> logical_decoding_work_mem = 5GB # min 64kB
>>> dynamic_shared_memory_type = posix # the default is the first option
>>> max_worker_processes = 20 # (change requires restart)
>>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
>>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
>>> max_parallel_workers = 15 # maximum number of max_worker_processes that
>>> wal_level = logical # minimal, replica, or logical
>>> max_wal_size = 3GB
>>> min_wal_size = 80MB
>>> #archive_mode = off
>>> wal_receiver_timeout = 1min # time that receiver waits for
>>> max_logical_replication_workers = 10 # taken from max_worker_processes
>>> max_sync_workers_per_subscription = 5 # taken from
>>> max_logical_replication_workers
>>>
>>> I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
>>> minutes each, but this had no positive effect.
>>>
>>> Some advice would be helpful
>>> --
>>> Willy-Bas Loos
>>>
>>
>>
>> --
>> Thanks,
>> Vijay
>> Mumbai, India
>>
>

--
Thanks,
Vijay
Mumbai, India

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Laura Smith 2021-05-29 16:00:32 Re: Overriding natural order of query results for a subset
Previous Message Michael van der Kolff 2021-05-29 14:23:22 Re: Overriding natural order of query results for a subset