Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing

From: Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing
Date: 2024-06-13 03:58:08
Message-ID: 379e1cc2-9f7c-4841-b561-00462fd3bf6c@cloud.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Στις 12/6/24 22:46, ο/η Achilleas Mantzios έγραψε:
> Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
>> Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:
>>>
>>> Hello
>>>
>>> as the subject says, we are trying to test a migration plan from
>>> 10.23 to 16.3, using logical replication to new OS, + checksums  +
>>> ICU + SQL_ASCII -> UTF-8, (some of which proved to be a challenge). 
>>> After solving all of the problems, and repeating the process, now it
>>> seems the system got into a weird state, in which all of the ~ 930
>>> tables got synced correctly , except the two largest ones. The DB in
>>> total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has already
>>> taken 4.8 and keep growing. No ERRORs or ROLLBACKS that would
>>> justify the growth (to my limited knowledge).
>>>
>>> Those two tables keep growing and growing in the 16.3 (subscriber) DB.
>>>
>>> In 10.23 (publisher)
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where  oid in ('marin
>>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>>> pg_total_relation_size | pg_size_pretty |          oid
>>> ------------------------+----------------+------------------------
>>>          1082956537856 | 1009 GB        | marinerpapers_atts
>>>           973440409600 | 907 GB         | marinerpapers_atts_tmp
>>> (2 rows)
>>>
>>> postgres(at)[local]/dynacom=#
>>>
>>> In 16.3 (subscriber)
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where  oid in ('marin
>>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>>> pg_total_relation_size | pg_size_pretty |          oid
>>> ------------------------+----------------+------------------------
>>>          1301943779328 | 1213 GB        | marinerpapers_atts
>>>          1295188762624 | 1206 GB        | marinerpapers_atts_tmp
>>> (2 rows)
>>>
>>>  The weird thing is that for every single table of the rest of the
>>> big ones, the size on the new machine seems smaller :
>>>
>>> 10.23 :
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where relkind in ('r'
>>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>>> pg_total_relation_size | pg_size_pretty |          oid
>>> ------------------------+----------------+-----------------------
>>>           634434142208 | 591 GB         | mail_entity
>>>           568166334464 | 529 GB         | fb_reports_files
>>>           538761527296 | 502 GB         | vmanews
>>>           115932397568 | 108 GB         | approval_request_docs
>>>           107413241856 | 100 GB         | status
>>>            86767763456 | 81 GB          | fb_reports_dets
>>>            82120941568 | 76 GB          | items_atts
>>>            81043136512 | 75 GB          | items_atts_tmp
>>>            59618181120 | 56 GB          | logging
>>>            58586382336 | 55 GB          | apm_evidence
>>> (10 rows)
>>>
>>> 16.3
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where relkind in ('r'
>>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>>> pg_total_relation_size | pg_size_pretty |          oid
>>> ------------------------+----------------+-----------------------
>>>           542910595072 | 506 GB         | fb_reports_files
>>>           515205160960 | 480 GB         | vmanews
>>>           491888680960 | 458 GB         | mail_entity
>>>           115919945728 | 108 GB         | approval_request_docs
>>>            82050457600 | 76 GB          | items_atts
>>>            80889028608 | 75 GB          | items_atts_tmp
>>>            64453533696 | 60 GB          | fb_reports_dets
>>>            57055133696 | 53 GB          | apm_evidence
>>>            55286161408 | 51 GB          | logging
>>>            52352860160 | 49 GB          | status
>>> (10 rows)
>>>
>>> For what it's worth, I restarted the target (subscriber) 16.3 DB
>>> today after compiling for JIT.  I know 10.23 is dead but anyways, 
>>> any clues or any advice would be great.
>>>
>> The two COPY processes on the publisher are CPU active since about
>> 11:27:00 EEST in the morning.
>
> I guess the synchronization since this is not based on WAL traffic
> does not keep any track of LSN's, its replication slot in the
> publisher appears dead, here's what I get on the 10.23 publisher :
>
>
> postgres(at)[local]/dynacom=# select * from pg_replication_slots where
> slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
>                   slot_name                   |  plugin  | slot_type |
> datoid | database | temporary | active | active_pid | xmin |
> catalog_xmin |  restart_lsn  | confirmed_flush_lsn
> -----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
>
> pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical   |
>  16401 | dynacom  | f         | f      |            |      |
>   1478474083 | 4056/C6000060 | 4056/C6000098
> pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical   |
>  16401 | dynacom  | f         | f      |            |      |
>   1478474083 | 4056/C6000028 | 4056/C6000060
> data_for_nu_pgsql_dynacom                     | pgoutput | logical   |
>  16401 | dynacom  | f         | t      |       5530 |      |
>   1478474084 | 4056/C7014DF0 | 4056/C8000000
> (3 rows)
>
> postgres(at)[local]/dynacom=#
>
> those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files
> dating back to 12:30 (maybe created via archive_timeout, rather than
> actual traffic)
>
> postgres(at)[local]/dynacom=# select * from pg_stat_replication where
> application_name = 'data_for_nu_pgsql_dynacom' or application_name ~*
> '_sync_';
> pid  | usesysid  | usename |               application_name
>                | client_addr | client_hostname | client_port |
>         backend_start         | backend_xmin |   state   |   sent_
> lsn    |   write_lsn   |   flush_lsn   |  replay_lsn   | write_lag |
> flush_lag | replay_lag | sync_priority | sync_state
> ------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
> -------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
>
> 5530 | 525054105 | repmgr  | data_for_nu_pgsql_dynacom
>                     | 10.9.230.24 |                 |       47204 |
> 2024-06-12 11:27:55.919914+03 |              | streaming | 4056/C8
> 000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 |           |
>           |            |             0 | async
> 5533 | 525054105 | repmgr  |
> pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |
>                 |       47228 | 2024-06-12 11:27:55.944498+03 |
>   1478474083 | startup   |
>       |               |               |               |           |
>           |            |             0 | async
> 5534 | 525054105 | repmgr  |
> pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |
>                 |       47242 | 2024-06-12 11:27:55.95233+03  |
>   1478474083 | startup   |
>       |               |               |               |           |
>           |            |             0 | async
> (3 rows)
>
> On the subscription side I get :
>
> postgres(at)[local]/dynacom=# select * from pg_stat_subscription where
> subname ~* 'dynacom';
>  subid   |          subname          |   pid   | leader_pid |  relid
>   | received_lsn  |      last_msg_send_time       |
>     last_msg_receipt_time     | latest_end_lsn |        latest_end_time
> ----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
>
> 54534123 | data_for_nu_pgsql_dynacom | 3143304 |            |
>          | 4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12
> 22:41:57.062591+03 | 4056/C8000000  | 2024-06-12 22:41:23.801094+03
> 54534123 | data_for_nu_pgsql_dynacom | 3143305 |            | 54522019
> |               | 2024-06-12 11:28:28.43723+03  | 2024-06-12
> 11:28:28.43723+03  |                | 2024-06-12 11:28:28.43723+03
> 54534123 | data_for_nu_pgsql_dynacom | 3143306 |            | 54522027
> |               | 2024-06-12 11:28:28.448293+03 | 2024-06-12
> 11:28:28.448293+03 |                | 2024-06-12 11:28:28.448293+03
> (3 rows)
>
> both COPY processes seem CPU active on the publisher and also the
> corresponding logical replication workers on the subscriber seem
> active as well.
>
Hi again,  it seems it ended, with the correct table count for both
tables. The sync phase runs in one single transaction, it seems  a DB
(subscriber) restart causes cancellations and the COPY to start all
over, hence the lost space. I will have to try once again from scratch
without any interruptions.
>
>>> --
>>> Achilleas Mantzios
>>> IT DEV - HEAD
>>> IT DEPT
>>> Dynacom Tankers Mgmt (as agents only)
>> --
>> Achilleas Mantzios
>> IT DEV - HEAD
>> IT DEPT
>> Dynacom Tankers Mgmt (as agents only)
> --
> Achilleas Mantzios
> IT DEV - HEAD
> IT DEPT
> Dynacom Tankers Mgmt (as agents only)

--
Achilleas Mantzios
IT DEV - HEAD
IT DEPT
Dynacom Tankers Mgmt (as agents only)

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Matti Linnanvuori 2024-06-13 04:13:32 Re: How to Handle Sequences in Logical Bi-Directional Replication
Previous Message PGSQL DBA 2024-06-13 01:23:18 How to Handle Sequences in Logical Bi-Directional Replication