From: | Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com> |
---|---|
To: | pgsql-admin(at)lists(dot)postgresql(dot)org |
Subject: | Re: Logical replication from PostgreSQL 10.23 to 16.3, last two big tables (getting synced) just keep growing |
Date: | 2024-06-13 03:58:08 |
Message-ID: | 379e1cc2-9f7c-4841-b561-00462fd3bf6c@cloud.gatewaynet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Στις 12/6/24 22:46, ο/η Achilleas Mantzios έγραψε:
> Στις 12/6/24 21:09, ο/η Achilleas Mantzios έγραψε:
>> Στις 12/6/24 20:14, ο/η Achilleas Mantzios έγραψε:
>>>
>>> Hello
>>>
>>> as the subject says, we are trying to test a migration plan from
>>> 10.23 to 16.3, using logical replication to new OS, + checksums +
>>> ICU + SQL_ASCII -> UTF-8, (some of which proved to be a challenge).
>>> After solving all of the problems, and repeating the process, now it
>>> seems the system got into a weird state, in which all of the ~ 930
>>> tables got synced correctly , except the two largest ones. The DB in
>>> total (in its 10.23 version) takes up 4.6 TB. In 16.3 it has already
>>> taken 4.8 and keep growing. No ERRORs or ROLLBACKS that would
>>> justify the growth (to my limited knowledge).
>>>
>>> Those two tables keep growing and growing in the 16.3 (subscriber) DB.
>>>
>>> In 10.23 (publisher)
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where oid in ('marin
>>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>>> pg_total_relation_size | pg_size_pretty | oid
>>> ------------------------+----------------+------------------------
>>> 1082956537856 | 1009 GB | marinerpapers_atts
>>> 973440409600 | 907 GB | marinerpapers_atts_tmp
>>> (2 rows)
>>>
>>> postgres(at)[local]/dynacom=#
>>>
>>> In 16.3 (subscriber)
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where oid in ('marin
>>> erpapers_atts'::regclass,'marinerpapers_atts_tmp'::regclass);
>>> pg_total_relation_size | pg_size_pretty | oid
>>> ------------------------+----------------+------------------------
>>> 1301943779328 | 1213 GB | marinerpapers_atts
>>> 1295188762624 | 1206 GB | marinerpapers_atts_tmp
>>> (2 rows)
>>>
>>> The weird thing is that for every single table of the rest of the
>>> big ones, the size on the new machine seems smaller :
>>>
>>> 10.23 :
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where relkind in ('r'
>>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>>> pg_total_relation_size | pg_size_pretty | oid
>>> ------------------------+----------------+-----------------------
>>> 634434142208 | 591 GB | mail_entity
>>> 568166334464 | 529 GB | fb_reports_files
>>> 538761527296 | 502 GB | vmanews
>>> 115932397568 | 108 GB | approval_request_docs
>>> 107413241856 | 100 GB | status
>>> 86767763456 | 81 GB | fb_reports_dets
>>> 82120941568 | 76 GB | items_atts
>>> 81043136512 | 75 GB | items_atts_tmp
>>> 59618181120 | 56 GB | logging
>>> 58586382336 | 55 GB | apm_evidence
>>> (10 rows)
>>>
>>> 16.3
>>>
>>> postgres(at)[local]/dynacom=# select pg_total_relation_size(oid),
>>> pg_size_pretty(pg_total_relation_size(oid)), oid::regclass from
>>> pg_class where relkind in ('r'
>>> ,'p','m') order by 1 DESC LIMIT 10 OFFSET 2;
>>> pg_total_relation_size | pg_size_pretty | oid
>>> ------------------------+----------------+-----------------------
>>> 542910595072 | 506 GB | fb_reports_files
>>> 515205160960 | 480 GB | vmanews
>>> 491888680960 | 458 GB | mail_entity
>>> 115919945728 | 108 GB | approval_request_docs
>>> 82050457600 | 76 GB | items_atts
>>> 80889028608 | 75 GB | items_atts_tmp
>>> 64453533696 | 60 GB | fb_reports_dets
>>> 57055133696 | 53 GB | apm_evidence
>>> 55286161408 | 51 GB | logging
>>> 52352860160 | 49 GB | status
>>> (10 rows)
>>>
>>> For what it's worth, I restarted the target (subscriber) 16.3 DB
>>> today after compiling for JIT. I know 10.23 is dead but anyways,
>>> any clues or any advice would be great.
>>>
>> The two COPY processes on the publisher are CPU active since about
>> 11:27:00 EEST in the morning.
>
> I guess the synchronization since this is not based on WAL traffic
> does not keep any track of LSN's, its replication slot in the
> publisher appears dead, here's what I get on the 10.23 publisher :
>
>
> postgres(at)[local]/dynacom=# select * from pg_replication_slots where
> slot_name = 'data_for_nu_pgsql_dynacom' or slot_name ~* '_sync_';
> slot_name | plugin | slot_type |
> datoid | database | temporary | active | active_pid | xmin |
> catalog_xmin | restart_lsn | confirmed_flush_lsn
> -----------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
>
> pg_54534123_sync_54522027_7374405793436622715 | pgoutput | logical |
> 16401 | dynacom | f | f | | |
> 1478474083 | 4056/C6000060 | 4056/C6000098
> pg_54534123_sync_54522019_7374405793436622715 | pgoutput | logical |
> 16401 | dynacom | f | f | | |
> 1478474083 | 4056/C6000028 | 4056/C6000060
> data_for_nu_pgsql_dynacom | pgoutput | logical |
> 16401 | dynacom | f | t | 5530 | |
> 1478474084 | 4056/C7014DF0 | 4056/C8000000
> (3 rows)
>
> postgres(at)[local]/dynacom=#
>
> those two LSNs ( 4056/C6000060 , 4056/C6000028 ) belong to files
> dating back to 12:30 (maybe created via archive_timeout, rather than
> actual traffic)
>
> postgres(at)[local]/dynacom=# select * from pg_stat_replication where
> application_name = 'data_for_nu_pgsql_dynacom' or application_name ~*
> '_sync_';
> pid | usesysid | usename | application_name
> | client_addr | client_hostname | client_port |
> backend_start | backend_xmin | state | sent_
> lsn | write_lsn | flush_lsn | replay_lsn | write_lag |
> flush_lag | replay_lag | sync_priority | sync_state
> ------+-----------+---------+-----------------------------------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+--------
> -------+---------------+---------------+---------------+-----------+-----------+------------+---------------+------------
>
> 5530 | 525054105 | repmgr | data_for_nu_pgsql_dynacom
> | 10.9.230.24 | | 47204 |
> 2024-06-12 11:27:55.919914+03 | | streaming | 4056/C8
> 000000 | 4056/C8000000 | 4056/C8000000 | 4056/C8000000 | |
> | | 0 | async
> 5533 | 525054105 | repmgr |
> pg_54534123_sync_54522019_7374405793436622715 | 10.9.230.24 |
> | 47228 | 2024-06-12 11:27:55.944498+03 |
> 1478474083 | startup |
> | | | | |
> | | 0 | async
> 5534 | 525054105 | repmgr |
> pg_54534123_sync_54522027_7374405793436622715 | 10.9.230.24 |
> | 47242 | 2024-06-12 11:27:55.95233+03 |
> 1478474083 | startup |
> | | | | |
> | | 0 | async
> (3 rows)
>
> On the subscription side I get :
>
> postgres(at)[local]/dynacom=# select * from pg_stat_subscription where
> subname ~* 'dynacom';
> subid | subname | pid | leader_pid | relid
> | received_lsn | last_msg_send_time |
> last_msg_receipt_time | latest_end_lsn | latest_end_time
> ----------+---------------------------+---------+------------+----------+---------------+-------------------------------+-------------------------------+----------------+-------------------------------
>
> 54534123 | data_for_nu_pgsql_dynacom | 3143304 | |
> | 4056/C8000000 | 2024-06-12 22:41:23.801094+03 | 2024-06-12
> 22:41:57.062591+03 | 4056/C8000000 | 2024-06-12 22:41:23.801094+03
> 54534123 | data_for_nu_pgsql_dynacom | 3143305 | | 54522019
> | | 2024-06-12 11:28:28.43723+03 | 2024-06-12
> 11:28:28.43723+03 | | 2024-06-12 11:28:28.43723+03
> 54534123 | data_for_nu_pgsql_dynacom | 3143306 | | 54522027
> | | 2024-06-12 11:28:28.448293+03 | 2024-06-12
> 11:28:28.448293+03 | | 2024-06-12 11:28:28.448293+03
> (3 rows)
>
> both COPY processes seem CPU active on the publisher and also the
> corresponding logical replication workers on the subscriber seem
> active as well.
>
Hi again, it seems it ended, with the correct table count for both
tables. The sync phase runs in one single transaction, it seems a DB
(subscriber) restart causes cancellations and the COPY to start all
over, hence the lost space. I will have to try once again from scratch
without any interruptions.
>
>>> --
>>> Achilleas Mantzios
>>> IT DEV - HEAD
>>> IT DEPT
>>> Dynacom Tankers Mgmt (as agents only)
>> --
>> Achilleas Mantzios
>> IT DEV - HEAD
>> IT DEPT
>> Dynacom Tankers Mgmt (as agents only)
> --
> Achilleas Mantzios
> IT DEV - HEAD
> IT DEPT
> Dynacom Tankers Mgmt (as agents only)
--
Achilleas Mantzios
IT DEV - HEAD
IT DEPT
Dynacom Tankers Mgmt (as agents only)
From | Date | Subject | |
---|---|---|---|
Next Message | Matti Linnanvuori | 2024-06-13 04:13:32 | Re: How to Handle Sequences in Logical Bi-Directional Replication |
Previous Message | PGSQL DBA | 2024-06-13 01:23:18 | How to Handle Sequences in Logical Bi-Directional Replication |