From: | Willy-Bas Loos <willybas(at)gmail(dot)com> |
---|---|
To: | Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com> |
Cc: | pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: WAL accumulating, Logical Replication pg 13 |
Date: | 2021-05-29 13:52:26 |
Message-ID: | CAHnozTjeBpon39FHGyOZ=1BL=x6yp+5E+pJix+aqFev46YX0Kw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Yeah, indexes could slow things down, thanks. Btw I'm not using logical
replication for the upgrade, that's not supported for 9.3.
It was more complicated but that's beside the point.
I could just delete the publication and all that belongs to it and start
over. But since I'm trying out logical replication, I would like to be more
in control than that. It's there anything that I can dig into to find out
why the WAL is accumulating?
Op vr 28 mei 2021 22:20 schreef Vijaykumar Jain <
vijaykumarjain(dot)github(at)gmail(dot)com>:
> I am not too sure with 9.3
> i tried an upgrade from 9.6 to 11 using logical replication (pg_logical
> extension)
>
> one thing to note.
> logical replication initiates a copy from a snapshot, then changes from
> then on.
>
> I had a very high insert rate on my source tables (v9.6) and the
> destination (v11) could not keep up (it had tons of indexes when I copied
> the schema) and it took around a day as the table had around 12 indexes.
>
> So at the destination(v11), I dropped all but the primary index for each
> table, started subscription and when it was almost caught up, rebuilt the
> index on the destination concurrently.
> it completed in 4-5 hours without stopping the source.
> migration completed in a few mins :)
>
> not sure if this would help, but just FYI.
>
>
> On Sat, 29 May 2021 at 01:36, Willy-Bas Loos <willybas(at)gmail(dot)com> wrote:
>
>> Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on
>> Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a
>> problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal
>> and my disks are getting full. The oldest WAL file is 18 days old.
>> I use Logical Replication from the new cluster to another new cluster
>> with 1 subscriber and 1 subscription.
>>
>> pg_stat_subscription tells me all recent timestamps.
>> and this:
>> db=# select * from pg_replication_slots;
>> -[ RECORD 1 ]-------+-------------
>> slot_name | my_pub1
>> plugin | pgoutput
>> slot_type | logical
>> datoid | 16401
>> database | db
>> temporary | f
>> active | t
>> active_pid | 9480
>> xmin |
>> catalog_xmin | 269168
>> restart_lsn | D4/908BC268
>> confirmed_flush_lsn | E1/25BF5710
>> wal_status | extended
>> safe_wal_size |
>>
>>
>>
>> I've had problems with diskspace on this server, with postgres crashing
>> because of it, then added more diskspace and postgres recovered. This
>> doesn't seem to be a problem now.
>>
>> The *publication* has the options publish = 'insert, update, delete,
>> truncate', publish_via_partition_root = false
>> The *subscription* has the options connect = true, enabled = true,
>> create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'
>>
>> The log on the publisher says:
>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db LOG: starting logical
>> decoding for slot "my_pub1"
>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db DETAIL: Streaming
>> transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db LOG: logical decoding found
>> consistent point at D4/908BC268
>> 2021-05-25 21:25:18.973 CEST [4584] user(at)db DETAIL: There are no
>> running transactions.
>> 2021-05-25 21:29:49.456 CEST [4614] user(at)db ERROR: replication slot
>> "my_pub1" is active for PID 4584
>> 2021-05-25 21:29:54.474 CEST [4615] user(at)db ERROR: replication slot
>> "my_pub1" is active for PID 4584
>>
>> And on the subscriber:
>> 2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply
>> worker for subscription "my_pub1" has started
>> 2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL
>> streaming: ERROR: replication slot "my_pub1" is active for PID 730
>> 2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical
>> replication worker" (PID 40039) exited with exit code 1
>>
>> The postgres settings on the *publisher* are:
>> max_connections = 100 # (change requires restart)
>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>> shared_buffers = 50GB # min 128kB
>> work_mem = 1GB # min 64kB
>> maintenance_work_mem = 10GB # min 1MB
>> logical_decoding_work_mem = 5GB # min 64kB
>> dynamic_shared_memory_type = posix # the default is the first option
>> max_worker_processes = 20 # (change requires restart)
>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
>> max_parallel_workers = 15 # maximum number of max_worker_processes that
>> wal_level = logical # minimal, replica, or logical
>> max_wal_size = 1GB
>> min_wal_size = 80MB
>> #archive_mode = off
>> max_wal_senders = 10 # max number of walsender processes
>> wal_sender_timeout = 1min # in milliseconds; 0 disables
>> max_replication_slots = 7 # max number of replication slots
>>
>> On postgres settings on the *subscriber*:
>> max_connections = 100 # (change requires restart)
>> tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
>> shared_buffers = 25GB # min 128kB
>> work_mem = 1GB # min 64kB
>> maintenance_work_mem = 10GB # min 1MB
>> logical_decoding_work_mem = 5GB # min 64kB
>> dynamic_shared_memory_type = posix # the default is the first option
>> max_worker_processes = 20 # (change requires restart)
>> max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
>> max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
>> max_parallel_workers = 15 # maximum number of max_worker_processes that
>> wal_level = logical # minimal, replica, or logical
>> max_wal_size = 3GB
>> min_wal_size = 80MB
>> #archive_mode = off
>> wal_receiver_timeout = 1min # time that receiver waits for
>> max_logical_replication_workers = 10 # taken from max_worker_processes
>> max_sync_workers_per_subscription = 5 # taken from
>> max_logical_replication_workers
>>
>> I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10
>> minutes each, but this had no positive effect.
>>
>> Some advice would be helpful
>> --
>> Willy-Bas Loos
>>
>
>
> --
> Thanks,
> Vijay
> Mumbai, India
>
From | Date | Subject | |
---|---|---|---|
Next Message | Laura Smith | 2021-05-29 14:15:29 | Overriding natural order of query results for a subset |
Previous Message | Lionel Bouton | 2021-05-29 10:40:46 | Re: AWS forcing PG upgrade from v9.6 a disaster |