Re: WAL accumulating, Logical Replication pg 13

From: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To: Willy-Bas Loos <willybas(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: WAL accumulating, Logical Replication pg 13
Date: 2021-05-31 14:23:59
Message-ID: CAM+6J97gB_D+J7P34_g_LKiiJh55RY6g=7wVD1tAZWfNfU1npQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

So I got it all wrong it seems :)
You upgraded to pg13 fine? , but while on pg13 you have issues with logical
replication ?

There is a path in the postgresql source user subscription folder iirc
which covers various logical replication scenarios.
That may help you just in case.

I have tried all known ways in which replication breaks like above and once
I resolve conflicts it starts fine.
I’ll try to explore more scenarios.
Pardon my link to pglogical. I misunderstood.

On Mon, 31 May 2021 at 7:25 PM Willy-Bas Loos <willybas(at)gmail(dot)com> wrote:

> Thank you for elaborating those possible causes and for the suggestions
> you made.
> 1) if you have an inactive replication slot.
> There is only 1 replication slot and it is active. So that is not the
> issue.
>
> 2) Do you have archiving enabled?
> No, i never turned it on and so this is in the settings of both publisher
> and subscriber: #archive_mode = off (and show archive_mode; tells me the
> same)
>
> 3) logical replication can be broken for multiple reasons, like conflicts
> where the subscriber already has the data which primary wants to push. it
> will not proceed until the conflicts are resolved.
> That would have been in the log, but there isn't any conflict in the log.
> Only the messages that i posted with my first message.
>
> 4) poor connectivity or the computer/network resources not able to keep up
> with the load, can result in WAL pile up.
> This would be strange since there is a 10Gbps connection within the same
> rack. But it could theoretically be malfunctioning or the performance on
> the subscriber could be too low.
> If any of this is the case, shouldn't that be visible in
> pg_stat_subscription ?
>
> Thanks for the article, it's interesting to see how they transitioned from
> londiste, even if the article is about pglogical, not logical replication
> in the postgres core.
> I was using Londiste to transfer the data to the new server and minimize
> downtime, so the article might come in handy.
> I prepared by reading the documentation, which is very straightforward.
> >btw,
> >how are you doing logical replication with 9.3 ? using a pglogical
> extension ?
> No, I'm not using logical replication in postgres 9.3 . Only on postgres
> 13.
> About the link to the bug reports: Thanks for the suggestion. But first
> I'd like to get some better grip on what is going on before searching for
> bugs.
>
> Still, any help will be much appreciated
>
> On Sat, May 29, 2021 at 5:16 PM Vijaykumar Jain <
> vijaykumarjain(dot)github(at)gmail(dot)com> wrote:
>
>> WAL can be built up for reasons like
>> 1) if you have an inactive replication slot. I mean you had a streaming
>> replica which was turned off, but you did not remote the slot from primary.
>> 2) Do you have archiving enabled? Are the archiving commands running fine
>> ? if just the archiving is broken, then you can manually run
>> archive cleanup provided, replication is all caught up fine.
>>
>> 3) logical replication can be broken for multiple reasons, like conflicts
>> where the subscriber already has the data which primary wants to push. it
>> will not proceed until the conflicts are resolved.
>> 4) poor connectivity or the computer/network resources not able to keep
>> up with the load, can result in WAL pile up.
>>
>> there are many blogs around logical replication issues, but when it was
>> new in pg10, I read this.
>> Recovery use cases for Logical Replication in PostgreSQL 10 | by
>> Konstantin Evteev | AvitoTech | Medium
>> <https://medium.com/avitotech/recovery-use-cases-for-logical-replication-in-postgresql-10-a1e6bab03072>
>>
>> btw,
>> how are you doing logical replication with 9.3 ? using a pglogical
>> extension ?
>> I can try many things, but it would be wrong to make assumptions since i
>> did not work with 9.3
>> for ex.
>> Bug fix: Using ExecCopySlot during multi insert by bdrouvotAWS · Pull
>> Request #295 · 2ndQuadrant/pglogical (github.com)
>> <https://github.com/2ndQuadrant/pglogical/pull/295>
>> there are many issues posted here that may be relevant to your setup.
>>
>>
>>
>>
>>
>> On Sat, 29 May 2021 at 19:22, Willy-Bas Loos <willybas(at)gmail(dot)com> wrote:
>>
>>> Yeah, indexes could slow things down, thanks. Btw I'm not using logical
>>> replication for the upgrade, that's not supported for 9.3.
>>> It was more complicated but that's beside the point.
>>>
>>> I could just delete the publication and all that belongs to it and start
>>> over. But since I'm trying out logical replication, I would like to be more
>>> in control than that. It's there anything that I can dig into to find out
>>> why the WAL is accumulating?
>>>
>>> Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
>>> vijaykumarjain(dot)github(at)gmail(dot)com>:
>>>
>>>> I am not too sure with 9.3
>>>> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
>>>> extension)
>>>>
>>>> one thing to note.
>>>> logical replication initiates a copy from a snapshot, then changes from
>>>> then on.
>>>>
>>>> I had a very high insert rate on my source tables (v9.6) and the
>>>> destination (v11) could not keep up (it had tons of indexes when I copied
>>>> the schema) and it took around a day as the table had around 12 indexes.
>>>>
>>>> So at the destination(v11), I dropped all but the primary index for
>>>> each table, started subscription and when it was almost caught up, rebuilt
>>>> the index on the destination concurrently.
>>>> it completed in 4-5 hours without stopping the source.
>>>> migration completed in a few mins :)
>>>>
>>>> not sure if this would help, but just FYI.
>>>>
>>>>
>>>> On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas(at)gmail(dot)com>
>>>> wrote:
>>>>
>>>>> Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13
>>>>> on Debian 10. This is now in an Acceptance stage (DTAP). I have encountered
>>>>> a problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
>>>>> and my disks are getting full. The oldest WAL file is 18 days old.
>>>>> I use Logical Replication from the new cluster to another new cluster
>>>>> with 1 subscriber and 1 subscription.
>>>>>
>>>>> pg_stat_subscription tells me all recent timestamps.
>>>>> and this:
>>>>> db=# select * from pg_replication_slots;
>>>>> -[ RECORD 1 ]-------+-------------
>>>>> slot_name | my_pub1
>>>>> plugin | pgoutput
>>>>> slot_type | logical
>>>>> datoid | 16401
>>>>> database | db
>>>>> temporary | f
>>>>> active | t
>>>>> active_pid | 9480
>>>>> xmin |
>>>>> catalog_xmin | 269168
>>>>> restart_lsn | D4/908BC268
>>>>> confirmed_flush_lsn | E1/25BF5710
>>>>> wal_status | extended
>>>>> safe_wal_size |
>>>>>
>>>>>
>>>>>
>>>>> I've had problems with diskspace on this server, with postgres
>>>>> crashing because of it, then added more diskspace and postgres recovered.
>>>>> This doesn't seem to be a problem now.
>>>>>
>>>>> The *publication* has the options publish = 'insert, update, delete,
>>>>> truncate', publish_via_partition_root = false
>>>>> The *subscription* has the options connect = true, enabled = true,
>>>>> create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'
>>>>>
>>>>> The log on the publisher says:
>>>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db LOG: starting logical
>>>>> decoding for slot "my_pub1"
>>>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db DETAIL: Streaming
>>>>> transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
>>>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db LOG: logical decoding
>>>>> found consistent point at D4/908BC268
>>>>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db DETAIL: There are no
>>>>> running transactions.
>>>>> 2021-05-25 21:29:49.456 CEST [4614] user(at)db ERROR: replication slot
>>>>> "my_pub1" is active for PID 4584
>>>>> 2021-05-25 21:29:54.474 CEST [4615] user(at)db ERROR: replication slot
>>>>> "my_pub1" is active for PID 4584
>>>>>
>>>>> And on the subscriber:
>>>>> 2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
>>>>> worker for subscription "my_pub1" has started
>>>>> 2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
>>>>> streaming: ERROR: replication slot "my_pub1" is active for PID 730
>>>>> 2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
>>>>> replication worker" (PID 40039) exited with exit code 1
>>>>>
>>>>> The postgres settings on the *publisher* are:
>>>>> max_connections = 100 # (change requires restart)
>>>>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>>>>> shared_buffers = 50GB # min 128kB
>>>>> work_mem = 1GB # min 64kB
>>>>> maintenance_work_mem = 10GB # min 1MB
>>>>> logical_decoding_work_mem = 5GB # min 64kB
>>>>> dynamic_shared_memory_type = posix # the default is the first option
>>>>> max_worker_processes = 20 # (change requires restart)
>>>>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
>>>>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
>>>>> max_parallel_workers = 15 # maximum number of max_worker_processes that
>>>>> wal_level = logical # minimal, replica, or logical
>>>>> max_wal_size = 1GB
>>>>> min_wal_size = 80MB
>>>>> #archive_mode = off
>>>>> max_wal_senders = 10 # max number of walsender processes
>>>>> wal_sender_timeout = 1min # in milliseconds; 0 disables
>>>>> max_replication_slots = 7 # max number of replication slots
>>>>>
>>>>> On postgres settings on the *subscriber*:
>>>>> max_connections = 100 # (change requires restart)
>>>>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>>>>> shared_buffers = 25GB # min 128kB
>>>>> work_mem = 1GB # min 64kB
>>>>> maintenance_work_mem = 10GB # min 1MB
>>>>> logical_decoding_work_mem = 5GB # min 64kB
>>>>> dynamic_shared_memory_type = posix # the default is the first option
>>>>> max_worker_processes = 20 # (change requires restart)
>>>>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
>>>>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
>>>>> max_parallel_workers = 15 # maximum number of max_worker_processes that
>>>>> wal_level = logical # minimal, replica, or logical
>>>>> max_wal_size = 3GB
>>>>> min_wal_size = 80MB
>>>>> #archive_mode = off
>>>>> wal_receiver_timeout = 1min # time that receiver waits for
>>>>> max_logical_replication_workers = 10 # taken from max_worker_processes
>>>>> max_sync_workers_per_subscription = 5 # taken from
>>>>> max_logical_replication_workers
>>>>>
>>>>> I've tried increasing wal_sender_timeout and wal_receiver_timeout to
>>>>> 10 minutes each, but this had no positive effect.
>>>>>
>>>>> Some advice would be helpful
>>>>> --
>>>>> Willy-Bas Loos
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Vijay
>>>> Mumbai, India
>>>>
>>>
>>
>> --
>> Thanks,
>> Vijay
>> Mumbai, India
>>
>
>
> --
> Willy-Bas Loos
>
--
Thanks,
Vijay
Mumbai, India

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Willy-Bas Loos 2021-05-31 15:32:12 Re: WAL accumulating, Logical Replication pg 13
Previous Message Willy-Bas Loos 2021-05-31 13:54:56 Re: WAL accumulating, Logical Replication pg 13