Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing

From: Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing
Date: 2024-06-12 19:46:14
Message-ID: 7662073f-832a-4d5e-9f7c-2cb1287d453b@cloud.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
> Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:
>>
>> Hello
>>
>> as the subject says, we are trying to test a migration plan from
>> 10.23 to 16.3, using logical replication to new OS, + checksums  +
>> ICU + SQL_ASCII -> UTF-8, (some of which proved to be a challenge). 
>> After solving all of the problems, and repeating the process, now it
>> seems the system got into a weird state, in which all of the ~ 930
>> tables got synced correctly , except the two largest ones. The DB in
>> total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has already
>> taken 4.8 and keep growing. No ERRORs or ROLLBACKS that would justify
>> the growth (to my limited knowledge).
>>
>> Those two tables keep growing and growing in the 16.3 (subscriber) DB.
>>
>> In 10.23 (publisher)
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where  oid in ('marin
>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>> pg_total_relation_size | pg_size_pretty |          oid
>> ------------------------+----------------+------------------------
>>          1082956537856 | 1009 GB        | marinerpapers_atts
>>           973440409600 | 907 GB         | marinerpapers_atts_tmp
>> (2 rows)
>>
>> postgres(at)[local]/dynacom=#
>>
>> In 16.3 (subscriber)
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where  oid in ('marin
>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>> pg_total_relation_size | pg_size_pretty |          oid
>> ------------------------+----------------+------------------------
>>          1301943779328 | 1213 GB        | marinerpapers_atts
>>          1295188762624 | 1206 GB        | marinerpapers_atts_tmp
>> (2 rows)
>>
>>  The weird thing is that for every single table of the rest of the
>> big ones, the size on the new machine seems smaller :
>>
>> 10.23 :
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where relkind in ('r'
>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>> pg_total_relation_size | pg_size_pretty |          oid
>> ------------------------+----------------+-----------------------
>>           634434142208 | 591 GB         | mail_entity
>>           568166334464 | 529 GB         | fb_reports_files
>>           538761527296 | 502 GB         | vmanews
>>           115932397568 | 108 GB         | approval_request_docs
>>           107413241856 | 100 GB         | status
>>            86767763456 | 81 GB          | fb_reports_dets
>>            82120941568 | 76 GB          | items_atts
>>            81043136512 | 75 GB          | items_atts_tmp
>>            59618181120 | 56 GB          | logging
>>            58586382336 | 55 GB          | apm_evidence
>> (10 rows)
>>
>> 16.3
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where relkind in ('r'
>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>> pg_total_relation_size | pg_size_pretty |          oid
>> ------------------------+----------------+-----------------------
>>           542910595072 | 506 GB         | fb_reports_files
>>           515205160960 | 480 GB         | vmanews
>>           491888680960 | 458 GB         | mail_entity
>>           115919945728 | 108 GB         | approval_request_docs
>>            82050457600 | 76 GB          | items_atts
>>            80889028608 | 75 GB          | items_atts_tmp
>>            64453533696 | 60 GB          | fb_reports_dets
>>            57055133696 | 53 GB          | apm_evidence
>>            55286161408 | 51 GB          | logging
>>            52352860160 | 49 GB          | status
>> (10 rows)
>>
>> For what it's worth, I restarted the target (subscriber) 16.3 DB
>> today after compiling for JIT.  I know 10.23 is dead but anyways, 
>> any clues or any advice would be great.
>>
> The two COPY processes on the publisher are CPU active since about
> 11:27:00 EEST in the morning.

I guess the synchronization since this is not based on WAL traffic does
not keep any track of LSN's, its replication slot in the publisher
appears dead, here's what I get on the 10.23 publisher :

postgres(at)[local]/dynacom=# select * from pg_replication_slots where
slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
                  slot_name                   |  plugin  | slot_type |
datoid | database | temporary | active | active_pid | xmin |
catalog_xmin |  restart_lsn  | confirmed_flush_lsn
-----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------

pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical   |
 16401 | dynacom  | f         | f      |            |      |
  1478474083 | 4056/C6000060 | 4056/C6000098
pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical   |
 16401 | dynacom  | f         | f      |            |      |
  1478474083 | 4056/C6000028 | 4056/C6000060
data_for_nu_pgsql_dynacom                     | pgoutput | logical   |
 16401 | dynacom  | f         | t      |       5530 |      |
  1478474084 | 4056/C7014DF0 | 4056/C8000000
(3 rows)

postgres(at)[local]/dynacom=#

those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files dating
back to 12:30 (maybe created via archive_timeout, rather than actual
traffic)

postgres(at)[local]/dynacom=# select * from pg_stat_replication where
application_name = 'data_for_nu_pgsql_dynacom' or application_name ~*
'_sync_';
pid  | usesysid  | usename |               application_name
               | client_addr | client_hostname | client_port |
        backend_start         | backend_xmin |   state   |   sent_
lsn    |   write_lsn   |   flush_lsn   |  replay_lsn   | write_lag |
flush_lag | replay_lag | sync_priority | sync_state
------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
-------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------

5530 | 525054105 | repmgr  | data_for_nu_pgsql_dynacom
                    | 10.9.230.24 |                 |       47204 |
2024-06-12 11:27:55.919914+03 |              | streaming | 4056/C8
000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 |           |
          |            |             0 | async
5533 | 525054105 | repmgr  |
pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |
                |       47228 | 2024-06-12 11:27:55.944498+03 |
  1478474083 | startup   |
      |               |               |               |           |
          |            |             0 | async
5534 | 525054105 | repmgr  |
pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |
                |       47242 | 2024-06-12 11:27:55.95233+03  |
  1478474083 | startup   |
      |               |               |               |           |
          |            |             0 | async
(3 rows)

On the subscription side I get :

postgres(at)[local]/dynacom=# select * from pg_stat_subscription where
subname ~* 'dynacom';
 subid   |          subname          |   pid   | leader_pid |  relid
  | received_lsn  |      last_msg_send_time       |
    last_msg_receipt_time     | latest_end_lsn |        latest_end_time
----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------

54534123 | data_for_nu_pgsql_dynacom | 3143304 |            |          |
4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12
22:41:57.062591+03 | 4056/C8000000  | 2024-06-12 22:41:23.801094+03
54534123 | data_for_nu_pgsql_dynacom | 3143305 |            | 54522019 |
              | 2024-06-12 11:28:28.43723+03  | 2024-06-12
11:28:28.43723+03  |                | 2024-06-12 11:28:28.43723+03
54534123 | data_for_nu_pgsql_dynacom | 3143306 |            | 54522027 |
              | 2024-06-12 11:28:28.448293+03 | 2024-06-12
11:28:28.448293+03 |                | 2024-06-12 11:28:28.448293+03
(3 rows)

both COPY processes seem CPU active on the publisher and also the
corresponding logical replication workers on the subscriber seem active
as well.

>> --
>> Achilleas Mantzios
>> IT DEV - HEAD
>> IT DEPT
>> Dynacom Tankers Mgmt (as agents only)
> --
> Achilleas Mantzios
> IT DEV - HEAD
> IT DEPT
> Dynacom Tankers Mgmt (as agents only)

--
Achilleas Mantzios
IT DEV - HEAD
IT DEPT
Dynacom Tankers Mgmt (as agents only)

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Holger Jakobs 2024-06-12 19:50:52 Re: pg_upgrade
Previous Message Achilleas Mantzios 2024-06-12 18:09:26 Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing