Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing

From: Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing
Date: 2024-06-15 15:56:26
Message-ID: 627b7516-345f-49a8-aa2b-daf591216db2@cloud.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi just to close this,
I re did the whole process eliminating any error/restart and it all went
smoothly.

So, moral of the story, if COPY fails or is interrupted , either VACUUM
FULL will be needed to claim lost space,
or just redo the whole thing.

Just a general question, taking into account we are not the fastest shop
regarding the upgrades, would you recommend jumping on the 17 wagon
entirely, and then then aim for major upgrade every major release or
every two  (hopefully)?

Στις 13/6/24 06:58, ο/η Achilleas Mantzios έγραψε:
> Στις 12/6/24 22:46, ο/η Achilleas Mantzios έγραψε:
>> Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
>>> Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:
>>>>
>>>> Hello
>>>>
>>>> as the subject says, we are trying to test a migration plan from
>>>> 10.23 to 16.3, using logical replication to new OS, + checksums  +
>>>> ICU + SQL_ASCII -> UTF-8, (some of which proved to be a
>>>> challenge).  After solving all of the problems, and repeating the
>>>> process, now it seems the system got into a weird state, in which
>>>> all of the ~ 930 tables got synced correctly , except the two
>>>> largest ones. The DB in total (in its 10.23 version) takes up 4.6
>>>> TB. In 16.3 it has already taken 4.8 and keep growing. No ERRORs or
>>>> ROLLBACKS that would justify the growth (to my limited knowledge).
>>>>
>>>> Those two tables keep growing and growing in the 16.3 (subscriber) DB.
>>>>
>>>> In 10.23 (publisher)
>>>>
>>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>>> pg_class where  oid in ('marin
>>>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>>>> pg_total_relation_size | pg_size_pretty |          oid
>>>> ------------------------+----------------+------------------------
>>>>          1082956537856 | 1009 GB        | marinerpapers_atts
>>>>           973440409600 | 907 GB         | marinerpapers_atts_tmp
>>>> (2 rows)
>>>>
>>>> postgres(at)[local]/dynacom=#
>>>>
>>>> In 16.3 (subscriber)
>>>>
>>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>>> pg_class where  oid in ('marin
>>>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>>>> pg_total_relation_size | pg_size_pretty |          oid
>>>> ------------------------+----------------+------------------------
>>>>          1301943779328 | 1213 GB        | marinerpapers_atts
>>>>          1295188762624 | 1206 GB        | marinerpapers_atts_tmp
>>>> (2 rows)
>>>>
>>>>  The weird thing is that for every single table of the rest of the
>>>> big ones, the size on the new machine seems smaller :
>>>>
>>>> 10.23 :
>>>>
>>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>>> pg_class where relkind in ('r'
>>>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>>>> pg_total_relation_size | pg_size_pretty |          oid
>>>> ------------------------+----------------+-----------------------
>>>>           634434142208 | 591 GB         | mail_entity
>>>>           568166334464 | 529 GB         | fb_reports_files
>>>>           538761527296 | 502 GB         | vmanews
>>>>           115932397568 | 108 GB         | approval_request_docs
>>>>           107413241856 | 100 GB         | status
>>>>            86767763456 | 81 GB          | fb_reports_dets
>>>>            82120941568 | 76 GB          | items_atts
>>>>            81043136512 | 75 GB          | items_atts_tmp
>>>>            59618181120 | 56 GB          | logging
>>>>            58586382336 | 55 GB          | apm_evidence
>>>> (10 rows)
>>>>
>>>> 16.3
>>>>
>>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>>> pg_class where relkind in ('r'
>>>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>>>> pg_total_relation_size | pg_size_pretty |          oid
>>>> ------------------------+----------------+-----------------------
>>>>           542910595072 | 506 GB         | fb_reports_files
>>>>           515205160960 | 480 GB         | vmanews
>>>>           491888680960 | 458 GB         | mail_entity
>>>>           115919945728 | 108 GB         | approval_request_docs
>>>>            82050457600 | 76 GB          | items_atts
>>>>            80889028608 | 75 GB          | items_atts_tmp
>>>>            64453533696 | 60 GB          | fb_reports_dets
>>>>            57055133696 | 53 GB          | apm_evidence
>>>>            55286161408 | 51 GB          | logging
>>>>            52352860160 | 49 GB          | status
>>>> (10 rows)
>>>>
>>>> For what it's worth, I restarted the target (subscriber) 16.3 DB
>>>> today after compiling for JIT.  I know 10.23 is dead but anyways, 
>>>> any clues or any advice would be great.
>>>>
>>> The two COPY processes on the publisher are CPU active since about
>>> 11:27:00 EEST in the morning.
>>
>> I guess the synchronization since this is not based on WAL traffic
>> does not keep any track of LSN's, its replication slot in the
>> publisher appears dead, here's what I get on the 10.23 publisher :
>>
>>
>> postgres(at)[local]/dynacom=# select * from pg_replication_slots where
>> slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
>>                   slot_name                   |  plugin  | slot_type
>> | datoid | database | temporary | active | active_pid | xmin |
>> catalog_xmin |  restart_lsn  | confirmed_flush_lsn
>> -----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
>>
>> pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical
>>   |  16401 | dynacom  | f         | f      |            |      |
>>   1478474083 | 4056/C6000060 | 4056/C6000098
>> pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical
>>   |  16401 | dynacom  | f         | f      |            |      |
>>   1478474083 | 4056/C6000028 | 4056/C6000060
>> data_for_nu_pgsql_dynacom                     | pgoutput | logical
>>   |  16401 | dynacom  | f         | t      |       5530 |      |
>>   1478474084 | 4056/C7014DF0 | 4056/C8000000
>> (3 rows)
>>
>> postgres(at)[local]/dynacom=#
>>
>> those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files
>> dating back to 12:30 (maybe created via archive_timeout, rather than
>> actual traffic)
>>
>> postgres(at)[local]/dynacom=# select * from pg_stat_replication where
>> application_name = 'data_for_nu_pgsql_dynacom' or application_name ~*
>> '_sync_';
>> pid  | usesysid  | usename |               application_name
>>                | client_addr | client_hostname | client_port |
>>         backend_start         | backend_xmin |   state   |   sent_
>> lsn    |   write_lsn   |   flush_lsn   |  replay_lsn   | write_lag |
>> flush_lag | replay_lag | sync_priority | sync_state
>> ------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
>> -------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
>>
>> 5530 | 525054105 | repmgr  | data_for_nu_pgsql_dynacom
>>                     | 10.9.230.24 |                 |       47204 |
>> 2024-06-12 11:27:55.919914+03 |              | streaming | 4056/C8
>> 000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 |           |
>>           |            |             0 | async
>> 5533 | 525054105 | repmgr  |
>> pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |
>>                 |       47228 | 2024-06-12 11:27:55.944498+03 |
>>   1478474083 | startup   |
>>       |               |               |               |           |
>>           |            |             0 | async
>> 5534 | 525054105 | repmgr  |
>> pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |
>>                 |       47242 | 2024-06-12 11:27:55.95233+03  |
>>   1478474083 | startup   |
>>       |               |               |               |           |
>>           |            |             0 | async
>> (3 rows)
>>
>> On the subscription side I get :
>>
>> postgres(at)[local]/dynacom=# select * from pg_stat_subscription where
>> subname ~* 'dynacom';
>>  subid   |          subname          |   pid   | leader_pid |  relid
>>   | received_lsn  |      last_msg_send_time       |
>>     last_msg_receipt_time     | latest_end_lsn |        latest_end_time
>> ----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
>>
>> 54534123 | data_for_nu_pgsql_dynacom | 3143304 |            |
>>          | 4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12
>> 22:41:57.062591+03 | 4056/C8000000  | 2024-06-12 22:41:23.801094+03
>> 54534123 | data_for_nu_pgsql_dynacom | 3143305 |            |
>> 54522019 |               | 2024-06-12 11:28:28.43723+03  | 2024-06-12
>> 11:28:28.43723+03  |                | 2024-06-12 11:28:28.43723+03
>> 54534123 | data_for_nu_pgsql_dynacom | 3143306 |            |
>> 54522027 |               | 2024-06-12 11:28:28.448293+03 | 2024-06-12
>> 11:28:28.448293+03 |                | 2024-06-12 11:28:28.448293+03
>> (3 rows)
>>
>> both COPY processes seem CPU active on the publisher and also the
>> corresponding logical replication workers on the subscriber seem
>> active as well.
>>
> Hi again,  it seems it ended, with the correct table count for both
> tables. The sync phase runs in one single transaction, it seems  a DB
> (subscriber) restart causes cancellations and the COPY to start all
> over, hence the lost space. I will have to try once again from scratch
> without any interruptions.
>>
>>>> --
>>>> Achilleas Mantzios
>>>> IT DEV - HEAD
>>>> IT DEPT
>>>> Dynacom Tankers Mgmt (as agents only)
>>> --
>>> Achilleas Mantzios
>>> IT DEV - HEAD
>>> IT DEPT
>>> Dynacom Tankers Mgmt (as agents only)
>> --
>> Achilleas Mantzios
>> IT DEV - HEAD
>> IT DEPT
>> Dynacom Tankers Mgmt (as agents only)
> --
> Achilleas Mantzios
> IT DEV - HEAD
> IT DEPT
> Dynacom Tankers Mgmt (as agents only)

--
Achilleas Mantzios
IT DEV - HEAD
IT DEPT
Dynacom Tankers Mgmt (as agents only)

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Rui DeSousa 2024-06-15 16:17:04 Re: Keepalive
Previous Message Rui DeSousa 2024-06-14 20:32:18 Re: Keepalive