From: | Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com> |
---|---|
To: | pgsql-admin(at)lists(dot)postgresql(dot)org |
Subject: | Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing |
Date: | 2024-06-12 19:46:14 |
Message-ID: | 7662073f-832a-4d5e-9f7c-2cb1287d453b@cloud.gatewaynet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
> Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:
>>
>> Hello
>>
>> as the subject says, we are trying to test a migration plan from
>> 10.23 to 16.3, using logical replication to new OS, + checksums +
>> ICU + SQL_ASCII -> UTF-8, (some of which proved to be a challenge).
>> After solving all of the problems, and repeating the process, now it
>> seems the system got into a weird state, in which all of the ~ 930
>> tables got synced correctly , except the two largest ones. The DB in
>> total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has already
>> taken 4.8 and keep growing. No ERRORs or ROLLBACKS that would justify
>> the growth (to my limited knowledge).
>>
>> Those two tables keep growing and growing in the 16.3 (subscriber) DB.
>>
>> In 10.23 (publisher)
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where oid in ('marin
>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>> pg_total_relation_size | pg_size_pretty | oid
>> ------------------------+----------------+------------------------
>> 1082956537856 | 1009 GB | marinerpapers_atts
>> 973440409600 | 907 GB | marinerpapers_atts_tmp
>> (2 rows)
>>
>> postgres(at)[local]/dynacom=#
>>
>> In 16.3 (subscriber)
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where oid in ('marin
>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>> pg_total_relation_size | pg_size_pretty | oid
>> ------------------------+----------------+------------------------
>> 1301943779328 | 1213 GB | marinerpapers_atts
>> 1295188762624 | 1206 GB | marinerpapers_atts_tmp
>> (2 rows)
>>
>> The weird thing is that for every single table of the rest of the
>> big ones, the size on the new machine seems smaller :
>>
>> 10.23 :
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where relkind in ('r'
>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>> pg_total_relation_size | pg_size_pretty | oid
>> ------------------------+----------------+-----------------------
>> 634434142208 | 591 GB | mail_entity
>> 568166334464 | 529 GB | fb_reports_files
>> 538761527296 | 502 GB | vmanews
>> 115932397568 | 108 GB | approval_request_docs
>> 107413241856 | 100 GB | status
>> 86767763456 | 81 GB | fb_reports_dets
>> 82120941568 | 76 GB | items_atts
>> 81043136512 | 75 GB | items_atts_tmp
>> 59618181120 | 56 GB | logging
>> 58586382336 | 55 GB | apm_evidence
>> (10 rows)
>>
>> 16.3
>>
>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>> pg_class where relkind in ('r'
>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>> pg_total_relation_size | pg_size_pretty | oid
>> ------------------------+----------------+-----------------------
>> 542910595072 | 506 GB | fb_reports_files
>> 515205160960 | 480 GB | vmanews
>> 491888680960 | 458 GB | mail_entity
>> 115919945728 | 108 GB | approval_request_docs
>> 82050457600 | 76 GB | items_atts
>> 80889028608 | 75 GB | items_atts_tmp
>> 64453533696 | 60 GB | fb_reports_dets
>> 57055133696 | 53 GB | apm_evidence
>> 55286161408 | 51 GB | logging
>> 52352860160 | 49 GB | status
>> (10 rows)
>>
>> For what it's worth, I restarted the target (subscriber) 16.3 DB
>> today after compiling for JIT. I know 10.23 is dead but anyways,
>> any clues or any advice would be great.
>>
> The two COPY processes on the publisher are CPU active since about
> 11:27:00 EEST in the morning.
I guess the synchronization since this is not based on WAL traffic does
not keep any track of LSN's, its replication slot in the publisher
appears dead, here's what I get on the 10.23 publisher :
postgres(at)[local]/dynacom=# select * from pg_replication_slots where
slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
slot_name | plugin | slot_type |
datoid | database | temporary | active | active_pid | xmin |
catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical |
16401 | dynacom | f | f | | |
1478474083 | 4056/C6000060 | 4056/C6000098
pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical |
16401 | dynacom | f | f | | |
1478474083 | 4056/C6000028 | 4056/C6000060
data_for_nu_pgsql_dynacom | pgoutput | logical |
16401 | dynacom | f | t | 5530 | |
1478474084 | 4056/C7014DF0 | 4056/C8000000
(3 rows)
postgres(at)[local]/dynacom=#
those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files dating
back to 12:30 (maybe created via archive_timeout, rather than actual
traffic)
postgres(at)[local]/dynacom=# select * from pg_stat_replication where
application_name = 'data_for_nu_pgsql_dynacom' or application_name ~*
'_sync_';
pid | usesysid | usename | application_name
| client_addr | client_hostname | client_port |
backend_start | backend_xmin | state | sent_
lsn | write_lsn | flush_lsn | replay_lsn | write_lag |
flush_lag | replay_lag | sync_priority | sync_state
------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
-------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
5530 | 525054105 | repmgr | data_for_nu_pgsql_dynacom
| 10.9.230.24 | | 47204 |
2024-06-12 11:27:55.919914+03 | | streaming | 4056/C8
000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 | |
| | 0 | async
5533 | 525054105 | repmgr |
pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |
| 47228 | 2024-06-12 11:27:55.944498+03 |
1478474083 | startup |
| | | | |
| | 0 | async
5534 | 525054105 | repmgr |
pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |
| 47242 | 2024-06-12 11:27:55.95233+03 |
1478474083 | startup |
| | | | |
| | 0 | async
(3 rows)
On the subscription side I get :
postgres(at)[local]/dynacom=# select * from pg_stat_subscription where
subname ~* 'dynacom';
subid | subname | pid | leader_pid | relid
| received_lsn | last_msg_send_time |
last_msg_receipt_time | latest_end_lsn | latest_end_time
----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
54534123 | data_for_nu_pgsql_dynacom | 3143304 | | |
4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12
22:41:57.062591+03 | 4056/C8000000 | 2024-06-12 22:41:23.801094+03
54534123 | data_for_nu_pgsql_dynacom | 3143305 | | 54522019 |
| 2024-06-12 11:28:28.43723+03 | 2024-06-12
11:28:28.43723+03 | | 2024-06-12 11:28:28.43723+03
54534123 | data_for_nu_pgsql_dynacom | 3143306 | | 54522027 |
| 2024-06-12 11:28:28.448293+03 | 2024-06-12
11:28:28.448293+03 | | 2024-06-12 11:28:28.448293+03
(3 rows)
both COPY processes seem CPU active on the publisher and also the
corresponding logical replication workers on the subscriber seem active
as well.
>> --
>> Achilleas Mantzios
>> IT DEV - HEAD
>> IT DEPT
>> Dynacom Tankers Mgmt (as agents only)
> --
> Achilleas Mantzios
> IT DEV - HEAD
> IT DEPT
> Dynacom Tankers Mgmt (as agents only)
--
Achilleas Mantzios
IT DEV - HEAD
IT DEPT
Dynacom Tankers Mgmt (as agents only)
From | Date | Subject | |
---|---|---|---|
Next Message | Holger Jakobs | 2024-06-12 19:50:52 | Re: pg_upgrade |
Previous Message | Achilleas Mantzios | 2024-06-12 18:09:26 | Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing |