From: | Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com> |
---|---|
To: | pgsql-admin(at)lists(dot)postgresql(dot)org |
Subject: | Re: logical replication problems (10.4) |
Date: | 2018-08-06 08:29:31 |
Message-ID: | d9df1e61-743e-c239-f599-07f32dbc1886@matrix.gatewaynet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Hello,
yesterday, I increased both wal_sender_timeout and wal_receiver_timeout , from '1 min' to '5 min' on publisher / subscriber sides, then I dropped the subscription, I truncated all tables on the
subscriber and recreated the subscription.
In less than 9 hours all 1.2T worth of contents had been transferred to the subscriber, with no errors, I checked row counts everything looks good.
Everything in pg_stat_replication, pg_stat_subscription , pg_replication_slots , looks good.
Unfortunately the day before I hadn't yet made myself comfortable with all the monitoring involved, therefore I might have missed something in my first email, or indeed may have done smth wrong. So I
am not sure if only the increase in timeout might have such a tremendous impact on the reliability of the logical replication system. And it case it has an impact, I am sorry that I wasn't savvy
enough to provide more info (or at least trigger an answer in this thread).
Anyway, I am planning to add some tables to the publication. I guess, I should add the tables to the publication, then
ALTER SUBSCRIPTION/||/data_for_testsmadb_pub REFRESH PUBLICATION WITH (refresh=true) ;
On 03/08/2018 14:48, Achilleas Mantzios wrote:
> Hello,
> yesterday we did the first effort to deploy our test database server (aimed at being used by developers and testers) using logical replication, using a publication for selected tables (563 in total
> with some of them being pretty large with bytea atts).
> Expected total DB size : about 1.2 TB
> I started it (created the subscription) at about 2018-08-02 11:35:00. By 2018-08-02 22:02:43.778 I got the last logical replication-related message on the subscriber side:
> 2018-08-02 22:02:43.778 EEST @ line:1 LOG: logical replication table synchronization worker for subscription "data_for_testsmadb_pub", table "moc_final_persons" has started
>
> On the publisher side I got :
> 2018-08-02 22:06:07.726 EEST data_for_testsmadb_pub repmgr(at)dynacom line:7 LOG: terminating walsender process due to replication timeout
> 2018-08-02 22:06:07.726 EEST data_for_testsmadb_pub repmgr(at)dynacom line:8 LOG: disconnection: session time: 0:04:34.968 user=repmgr database=dynacom host=10.9.0.77 port=47820
> and later
> 2018-08-02 22:10:15.153 EEST data_for_testsmadb_pub_33650_sync_22294 repmgr(at)dynacom line:5 LOG: duration: 25428550.886 ms statement: COPY public.marinerpapers_atts TO STDOUT
>
> The nightly barman differential backup took more than twice the time and space than usual, totaling to some 500GB (whereas usually it is about 200 GB).
>
> In the morning I saw no activity. pg_stat_subscription showed last send / receipt time yesterday night (unfortunately I didn't keep this info).
> Then I reacted by restarting the subscriber database and this seemed to "wake up" the logical replication.
> Still this is in progress syncing a large table : marinerpapers_atts (about 400GB) (hopefully the last?).
>
> I don't know if all this is going according to the intended/documented behavior.
>
> Currently I have 2 wal sender / logical replication worker pairs :
>
> one which is doing the streaming and from the looks of it it is pretty close to the LSNs of your upstream physical standby and
> one which still tries to sync this very large table : marinerpapers_atts .
>
> While I was at it I did some tests on a similar table named : marinerpapers_atts_tmp , and here is what I get to my surprise :
>
> Publisher :
> select count(*) from marinerpapers_atts_tmp;
> count
> --------
> 673391
> (1 row)
>
> Subscriber :
> select count(*) from marinerpapers_atts_tmp;
> count
> --------
> 673377
> (1 row)
>
> There is a 14 difference in row count, which seems to be consistent, if retry those counts in a while I'll get again 14 difference in row count. I diffed two times and found out the 14 missing rows.
> I got no errors regarding this table, neither on the publisher nor on the subscriber :
>
> Publisher : egrep -e 'data_for.*22304' data/log/postgresql-2018-08-02.log
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 15:32:41.181 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:3 LOG: logical decoding found consistent point at FE5/53827A50
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 15:32:41.181 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:4 DETAIL: There are no running transactions.
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 22:01:27.502 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:5 LOG: duration: 23326314.921 ms statement: COPY
> public.marinerpapers_atts_tmp TO STDOUT
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 22:01:27.669 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:6 LOG: disconnection: session time: 6:28:46.565 user=repmgr
> database=dynacom host=10.9.0.77 port=47819
>
> Subscriber : egrep -e 'marinerpapers_atts_tmp' data/log/postgresql-02.log
> [34052] 5b62f9e9.8504 2018-08-02 15:32:41.043 EEST @ line:1 LOG: logical replication table synchronization worker for subscription "data_for_testsmadb_pub", table "marinerpapers_atts_tmp" has
> started
> [34052] 5b62f9e9.8504 2018-08-02 22:01:27.663 EEST @ line:2 LOG: logical replication table synchronization worker for subscription "data_for_testsmadb_pub", table "marinerpapers_atts_tmp" has
> finished
>
>
> No error at all, but I got those 14 rows missing .
>
--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt
From | Date | Subject | |
---|---|---|---|
Next Message | Mariel Cherkassky | 2018-08-06 08:48:04 | Re: PostgreSQL 11 global index |
Previous Message | Jehan-Guillaume (ioguix) de Rorthais | 2018-08-06 08:02:01 | Re: PostgreSQL 11 global index |