Re: logical replication problems (10.4)

From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: logical replication problems (10.4)
Date: 2018-08-06 08:29:31
Message-ID: d9df1e61-743e-c239-f599-07f32dbc1886@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello,

yesterday, I increased both wal_sender_timeout and wal_receiver_timeout , from '1 min' to '5 min' on publisher / subscriber sides, then I dropped the subscription, I truncated all tables on the
subscriber and recreated the subscription.
In less than 9 hours all 1.2T worth of contents had been transferred to the subscriber, with no errors, I checked row counts everything looks good.

Everything in pg_stat_replication, pg_stat_subscription , pg_replication_slots , looks good.

Unfortunately the day before I hadn't yet made myself comfortable with all the monitoring involved, therefore I might have missed something in my first email, or indeed may have done smth wrong. So I
am not sure if only the increase in timeout might have such a tremendous impact on the reliability of the logical replication system. And it case it has an impact, I am sorry that I wasn't savvy
enough to provide more info (or at least trigger an answer in this thread).

Anyway, I am planning to add some tables to the publication.  I guess, I should add the tables to the publication, then

ALTER SUBSCRIPTION/||/data_for_testsmadb_pub REFRESH PUBLICATION WITH (refresh=true) ;

On 03/08/2018 14:48, Achilleas Mantzios wrote:
> Hello,
> yesterday we did the first effort to deploy our test database server (aimed at being used by developers and testers) using logical replication, using a publication for selected tables (563 in total
> with some of them being pretty large with bytea atts).
> Expected total DB size : about 1.2 TB
> I started it (created the subscription) at about 2018-08-02 11:35:00. By 2018-08-02 22:02:43.778 I got the last logical replication-related message on the subscriber side:
> 2018-08-02 22:02:43.778 EEST  @ line:1 LOG:  logical replication table synchronization worker for subscription "data_for_testsmadb_pub", table "moc_final_persons" has started
>
> On the publisher side I got :
> 2018-08-02 22:06:07.726 EEST data_for_testsmadb_pub repmgr(at)dynacom line:7 LOG:  terminating walsender process due to replication timeout
> 2018-08-02 22:06:07.726 EEST data_for_testsmadb_pub repmgr(at)dynacom line:8 LOG:  disconnection: session time: 0:04:34.968 user=repmgr database=dynacom host=10.9.0.77 port=47820
> and later
> 2018-08-02 22:10:15.153 EEST data_for_testsmadb_pub_33650_sync_22294 repmgr(at)dynacom line:5 LOG:  duration: 25428550.886 ms  statement: COPY public.marinerpapers_atts TO STDOUT
>
> The nightly barman differential backup took more than twice the time and space than usual, totaling to some 500GB (whereas usually it is about 200 GB).
>
> In the morning I saw no activity. pg_stat_subscription showed last send / receipt time yesterday night (unfortunately I didn't keep this info).
> Then I reacted by restarting the subscriber database and this seemed to "wake up" the logical replication.
> Still this is in progress syncing a large table : marinerpapers_atts (about 400GB) (hopefully the last?).
>
> I don't know if all this is going according to the intended/documented behavior.
>
> Currently I have 2 wal sender / logical replication worker pairs :
>
> one which is doing the streaming and from the looks of it it is pretty close to the LSNs of your upstream physical standby and
> one which still tries to sync this very large table : marinerpapers_atts .
>
> While I was at it I did some tests on a similar table named : marinerpapers_atts_tmp , and here is what I get to my surprise :
>
> Publisher :
> select count(*) from marinerpapers_atts_tmp;
>  count
> --------
>  673391
> (1 row)
>
> Subscriber :
> select count(*) from marinerpapers_atts_tmp;
>  count
> --------
>  673377
> (1 row)
>
> There is a 14 difference in row count, which seems to be consistent, if retry those counts in a while I'll get again 14 difference in row count. I diffed two times and found out the 14 missing rows.
> I got no errors regarding this table, neither on the publisher nor on the subscriber :
>
> Publisher : egrep -e 'data_for.*22304' data/log/postgresql-2018-08-02.log
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 15:32:41.181 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:3 LOG:  logical decoding found consistent point at FE5/53827A50
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 15:32:41.181 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:4 DETAIL:  There are no running transactions.
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 22:01:27.502 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:5 LOG:  duration: 23326314.921 ms  statement: COPY
> public.marinerpapers_atts_tmp TO STDOUT
> 10.9.0.77(47819) [113746] 5b62f9e9.1bc52 2018-08-02 22:01:27.669 EEST data_for_testsmadb_pub_33650_sync_22304 repmgr(at)dynacom line:6 LOG:  disconnection: session time: 6:28:46.565 user=repmgr
> database=dynacom host=10.9.0.77 port=47819
>
> Subscriber : egrep -e 'marinerpapers_atts_tmp' data/log/postgresql-02.log
>  [34052] 5b62f9e9.8504 2018-08-02 15:32:41.043 EEST  @ line:1 LOG:  logical replication table synchronization worker for subscription "data_for_testsmadb_pub", table "marinerpapers_atts_tmp" has
> started
>  [34052] 5b62f9e9.8504 2018-08-02 22:01:27.663 EEST  @ line:2 LOG:  logical replication table synchronization worker for subscription "data_for_testsmadb_pub", table "marinerpapers_atts_tmp" has
> finished
>
>
> No error at all, but I got those 14 rows missing .
>

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Mariel Cherkassky 2018-08-06 08:48:04 Re: PostgreSQL 11 global index
Previous Message Jehan-Guillaume (ioguix) de Rorthais 2018-08-06 08:02:01 Re: PostgreSQL 11 global index