Re: Logical replication existing data copy

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: Logical replication existing data copy
Date: 2017-03-29 13:06:41
Message-ID: 1f810a1c-d57b-e8af-44cb-2e55defb30f8@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/03/17 10:14, Erik Rijkers wrote:
> On 2017-03-09 11:06, Erik Rijkers wrote:
>>>>
>>>> I use three different machines (2 desktop, 1 server) to test logical
>>>> replication, and all three have now at least once failed to correctly
>>>> synchronise a pgbench session (amidst many succesful runs, of course)
>>>
>
>
> (At the moment using tese patches for tests:)
>
>> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
>> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch +
>> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch +
>> 0005-Skip-unnecessary-snapshot-builds.patch +
>
>
> The failed tests that I kept seeing (see the
> pgbench-over-logical-replication tests upthread) were never really
> 'solved'.
>
>
> But I have now finally figured out what caused these unexpected failed
> tests: it was wal_sender_timeout or rather, its default of 60 s.
>
> This caused 'terminating walsender process due to replication timeout'
> on the primary (not strictly an error), and the concomittant ERROR on
> the replica: 'could not receive data from WAL stream: server closed the
> connection unexpectedly'.
>
> here is a typical example (primary/replica logs time-intertwined, with
> 'primary'):
>
> [...]
> 2017-03-24 16:21:38.129 CET [15002] primary LOG: using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:42.690 CET [27515] primary LOG: using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:42.965 CET [14999] replica LOG: using stale
> statistics instead of current ones because stats collector is not
> responding
> 2017-03-24 16:21:49.816 CET [14930] primary LOG: terminating
> walsender process due to
> 2017-03-24 16:21:49.817 CET [14926] replica ERROR: could not
> receive data from WAL stream: server closed the connection unexpectedly
> 2017-03-24 16:21:49.824 CET [27502] replica LOG: worker process:
> logical replication worker for subscription 24864 (PID 14926) exited
> with exit code 1
> 2017-03-24 16:21:49.824 CET [27521] replica LOG: starting logical
> replication worker for subscription "sub1"
> 2017-03-24 16:21:49.828 CET [15008] replica LOG: logical
> replication apply for subscription sub1 started
> 2017-03-24 16:21:49.832 CET [15009] primary LOG: received
> replication command: IDENTIFY_SYSTEM
> 2017-03-24 16:21:49.832 CET [15009] primary LOG: received
> replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440
> (proto_version '1', publication_names '"pub1"')
> 2017-03-24 16:21:49.833 CET [15009] primary DETAIL: streaming
> transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
> 2017-03-24 16:21:49.833 CET [15009] primary LOG: starting logical
> decoding for slot "sub1"
> 2017-03-24 16:21:50.471 CET [15009] primary DETAIL: Logical
> decoding will begin using saved snapshot.
> 2017-03-24 16:21:50.471 CET [15009] primary LOG: logical decoding
> found consistent point at 3/FC820FC0
> 2017-03-24 16:21:51.169 CET [15008] replica DETAIL: Key
> (hid)=(9014) already exists.
> 2017-03-24 16:21:51.169 CET [15008] replica ERROR: duplicate key
> value violates unique constraint "pgbench_history_pkey"
> 2017-03-24 16:21:51.170 CET [27502] replica LOG: worker process:
> logical replication worker for subscription 24864 (PID 15008) exited
> with exit code 1
> 2017-03-24 16:21:51.170 CET [27521] replica LOG: starting logical
> replication worker for subscription "sub1"
> [...]
>
> My primary and replica were always on a single machine (making it more
> likely that that timeout is reached?). In my testing it seems that
> reaching the timeout on the primary (and 'closing the connection
> unexpectedly' on the replica) does not necessarily break the logical
> replication. But almost all log-rep failures that I have seen were
> started by this sequence of events.
>
> After setting wal_sender_timeout to 3 minutes there were no more
> failed tests.
>
> Perhaps it warrants setting wal_sender_timeout a bit higher than the
> current default of 60 seconds? After all I also saw the 'replication
> timeout' / 'closed the connection' couple rather often during
> not-failing tests. (These also disappeared, almost completely, with a
> higher setting of wal_sender_timeout)
>
> In any case it would be good to mention the setting (and its potentially
> deteriorating effect) somehere nearer the logical replication treatment.
>
> ( I read about wal_sender_timeout and keepalive ping, perhaps there's
> (still) something amiss there? Just a guess, I don't know )
>
> As I said, I saw no more failures with the higher 3 minute setting, with
> one exception: the one test that straddled the DST change (saterday 24
> march 02:00 h). I am happy to discount that one failure but strictly
> speaking I suppose it should be able to take DST into its stride.
>
>

Hi,

I think what you have seen is because of this:
https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1abb2e(at)2ndquadrant(dot)com#b235fa69-147a-5e09-f8f3-3f780a1abb2e@2ndquadrant.com

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-03-29 13:08:03 Re: Allow interrupts on waiting standby
Previous Message Nikolay Shaplov 2017-03-29 12:51:43 Re: [PATCH v.7] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM