Re: Unexpected data when subscribing to logical replication slot

From: Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
To: Daniel McKenzie <daniel(dot)mckenzie(at)curvedental(dot)com>
Cc: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, tomas(dot)vondra(at)enterprisedb(dot)com, pgsql-general(at)postgresql(dot)org
Subject: Re: Unexpected data when subscribing to logical replication slot
Date: 2024-05-09 15:16:45
Message-ID: CAKkG4_nXXXauGatuEgb1xKJZ7omc5HgCnwEfY4b=0zmwLv3stw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Sorry, to correct myself. The pg_xact bit is written with the next
checkpoint. But the COMMIT record in the WAL is there.

On Thu, May 9, 2024 at 5:14 PM Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
wrote:

> I would not find this behavior surprising in particular if you have a
> synchronous replica. According to the documentation of synchronous_commit:
>
> The local behavior of all non-off modes is to wait for local flush of
> WAL to disk.
>
> This is when the logical decoder sees the item. But that does not mean the
> change is visible to other transactions in the MVCC sense. So, if wal2json
> and the rest of your stuff is fast enough, then the enrichment query may
> very well read old data.
>
> A transaction being committed means all the WAL has been written (and
> usually synced) to disk including the bit in the pg_xact directory.
>
> On Thu, May 9, 2024 at 9:32 AM Daniel McKenzie <
> daniel(dot)mckenzie(at)curvedental(dot)com> wrote:
>
>> Asynchronous commit introduces the risk of data loss. There is a short
>>> time window between the report of transaction completion to the client
>>> and the time that the transaction is truly committed.
>>
>>
>> The documentation speaks about synchronous_commit changing how
>> transactions change behaviour for the client. So in this case, my psql
>> terminal is the client, and I would expect a faster commit (from its
>> perspective) and then a period of risk (as a process usually done as part
>> of the commit is now being done in the background) but it's not clear how
>> that affects a replication slot subscriber.
>>
>> What we're struggling to understand is: why are we seeing any updates in
>> the replication slot before they have been "truly committed"?
>>
>> There appears to be a state of limbo between updating data and that data
>> being available to query (and our subscriber is picking up changes in this
>> period of time) but I can't pin down any documentation which describes it.
>>
>> We've had this running in live now for years without a hiccup so we are
>> surprised to learn that we have this massive race condition and it just so
>> happens that the hardware is fast enough to process the transaction before
>> the .NET application can react to replication slot changes.
>>
>> Daniel McKenzie
>> Software Developer
>>
>> Office: +1 403.910.5927 x 251
>> Mobile: +44 7712 159045
>> Website: www.curvedental.com
>>
>> *Curve Dental Confidentiality Notice*
>> This message is intended exclusively for the individual or entity to
>> which it is addressed. This communication may contain information that
>> is proprietary, privileged, confidential, or otherwise legally exempt from
>> disclosure. If you are not the named addressee, you are not authorized
>> to read, print, retain, copy, or disseminate this message or any part of
>> it. If you have received this message in error, please notify the sender
>> immediately by replying to this e-mail and delete all copies of this
>> message.
>>
>>
>> On Wed, May 8, 2024 at 5:28 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
>> wrote:
>>
>>> On 5/8/24 08:24, Daniel McKenzie wrote:
>>> > It's running both (in docker containers) and also quite a few more
>>> > docker containers running various .NET applications.
>>>
>>> I think what you found is that the r7a.medium instance is not capable
>>> enough to do all that it is asked without introducing lag under load.
>>> Answering the questions posed by Tomas Vondra would help get to the
>>> actual cause of the lag.
>>>
>>> In meantime my suspicion is this part:
>>>
>>> "For example, when I use a psql terminal to update a user's last name
>>> from "Jones" to "Smith" then I would expect the enrichment query to find
>>> "Smith" but it will sometimes still find "Jones". It finds the old data
>>> perhaps 1 in 50 times."
>>>
>>> If this is being run against the Postgres server my guess is that
>>> synchronous_commit=on is causing the commit on the server to wait for
>>> the WAL records to be flushed to disk and this is not happening in a
>>> timely manner in the '... 1 in 50 times' you mention. In that case you
>>> see the old values not the new committed values. This seems to be
>>> confirmed when you set synchronous_commit=off and don't see old values.
>>> For completeness per:
>>>
>>> https://www.postgresql.org/docs/current/wal-async-commit.html
>>>
>>> "However, for short transactions this delay is a major component of the
>>> total transaction time. Selecting asynchronous commit mode means that
>>> the server returns success as soon as the transaction is logically
>>> completed, before the WAL records it generated have actually made their
>>> way to disk. This can provide a significant boost in throughput for
>>> small transactions.
>>>
>>> Asynchronous commit introduces the risk of data loss. There is a short
>>> time window between the report of transaction completion to the client
>>> and the time that the transaction is truly committed (that is, it is
>>> guaranteed not to be lost if the server crashes). ...
>>> "
>>>
>>> >
>>> > Daniel McKenzie
>>> > Software Developer
>>>
>>> --
>>> Adrian Klaver
>>> adrian(dot)klaver(at)aklaver(dot)com
>>>
>>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2024-05-09 16:45:13 Re: Unexpected data when subscribing to logical replication slot
Previous Message Torsten Förtsch 2024-05-09 15:14:59 Re: Unexpected data when subscribing to logical replication slot