Re: logical decoding and replication of sequences

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Petr Jelinek <petr(dot)jelinek(at)enterprisedb(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: logical decoding and replication of sequences
Date: 2022-04-06 14:13:52
Message-ID: d2a1fd50-0878-65f2-1e81-772065914f86@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/5/22 12:06, Amit Kapila wrote:
> On Mon, Apr 4, 2022 at 3:10 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> I did some experiments over the weekend, exploring how to rework the
>> sequence decoding in various ways. Let me share some WIP patches,
>> hopefully that can be useful for trying more stuff and moving this
>> discussion forward.
>>
>> I tried two things - (1) accumulating sequence increments in global
>> array and then doing something with it, and (2) treating all sequence
>> increments as regular changes (in a TXN) and then doing something
>> special during the replay. Attached are two patchsets, one for each
>> approach.
>>
>> Note: It's important to remember decoding of sequences is not the only
>> code affected by this. The logical messages have the same issue,
>> certainly when it comes to transactional vs. non-transactional stuff and
>> handling of snapshots. Even if the sequence decoding ends up being
>> reverted, we still need to fix that, somehow. And my feeling is the
>> solutions ought to be pretty similar in both cases.
>>
>> Now, regarding the two approaches:
>>
>> (1) accumulating sequences in global hash table
>>
>> The main problem with regular sequence increments is that those need to
>> be non-transactional - a transaction may use a sequence without any
>> WAL-logging, if the WAL was written by an earlier transaction. The
>> problem is the earlier trasaction might have been rolled back, and thus
>> simply discarded by the logical decoding. But we still need to apply
>> that, in order not to lose the sequence increment.
>>
>> The current code just applies those non-transactional increments right
>> after decoding the increment, but that does not work because we may not
>> have a snapshot at that point. And we only have the snapshot when within
>> a transaction (AFAICS) so this queues all changes and then applies the
>> changes later.
>>
>> The changes need to be shared by all transactions, so queueing them in a
>> global works fairly well - otherwise we'd have to walk all transactions,
>> in order to see if there are relevant sequence increments.
>>
>> But some increments may be transactional, e.g. when the sequence is
>> created or altered in a transaction. To allow tracking this, this uses a
>> hash table, with relfilenode as a key.
>>
>> There's a couple issues with this, though. Firstly, stashing the changes
>> outside transactions, it's not included in memory accounting, it's not
>> spilled to disk or streamed, etc. I guess fixing this is possible, but
>> it's certainly not straightforward, because we mix increments from many
>> different transactions.
>>
>> A bigger issue is that I'm not sure this actually handles the snapshots
>> correctly either.
>>
>> The non-transactional increments affect all transactions, so when
>> ReorderBufferProcessSequences gets executed, it processes all of them,
>> no matter the source transaction. Can we be sure the snapshot in the
>> applying transaction is the same (or "compatible") as the snapshot in
>> the source transaction?
>>
>
> I don't think we can assume that. I think it is possible that some
> other transaction's WAL can be in-between start/end lsn of txn (which
> we decide to send) which may not finally reach a consistent state.
> Consider a case similar to shown in one of my previous emails:
> Session-2:
> Begin;
> SELECT pg_current_xact_id();
>
> Session-1:
> SELECT 'init' FROM pg_create_logical_replication_slot('test_slot',
> 'test_decoding', false, true);
>
> Session-3:
> Begin;
> SELECT pg_current_xact_id();
>
> Session-2:
> Commit;
> Begin;
> INSERT INTO t1_seq SELECT nextval('seq1') FROM generate_series(1,100);
>
> Session-3:
> Commit;
>
> Session-2:
> Commit;
>
> Here, we send changes (say insert from txn 700) from session-2 because
> session-3's commit happens before it. Now, consider another
> transaction parallel to txn 700 which generates some WAL related to
> sequences but it committed before session-3's commit. So though, its
> changes will be the in-between start/end LSN of txn 700 but those
> shouldn't be sent.
>
> I have not tried this and also this may be solvable in some way but I
> think processing changes from other TXNs sounds risky to me in terms
> of snapshot handling.
>

Yes, I know this can happen. I was only really thinking about what might
happen to the relfilenode of the sequence itself - and I don't think any
concurrent transaction could swoop in and change the relfilenode in any
meaningful way, due to locking.

But of course, if we expect/require to have a perfect snapshot for that
exact position in the transaction, this won't work. IMO the whole idea
that we can have non-transactional bits in naturally transactional
decoding seems a bit suspicious (at least in hindsight).

No matter what we do for sequences, though, this still affects logical
messages too. Not sure what to do there :-(

>>
>>
>> (2) treating sequence change as regular changes
>>
>> This adopts a different approach - instead of accumulating the sequence
>> increments in a global hash table, it treats them as regular changes.
>> Which solves the snapshot issue, and issues with spilling to disk,
>> streaming and so on.
>>
>> But it has various other issues with handling concurrent transactions,
>> unfortunately, which probably make this approach infeasible:
>>
>> * The non-transactional stuff has to be applied in the first transaction
>> that commits, not in the transaction that generated the WAL. That does
>> not work too well with this approach, because we have to walk changes in
>> all other transactions.
>>
>
> Why do you want to traverse other TXNs in this approach? Is it because
> the current TXN might be using some value of sequence which has been
> actually WAL logged in the other transaction but that other
> transaction has not been sent yet? I think if we don't send that then
> probably replica sequences columns (in some tables) have some values
> but actually the sequence itself won't have still that value which
> sounds problematic. Is that correct?
>

Well, how else would you get to sequence changes in the other TXNs?

Consider this:

T1: begin
T2: begin

T2: nextval('s') -> writes WAL for 32 values
T1: nextval('s') -> gets value without WAL

T1: commit
T2: commit

Now, if we commit T1 without "applying" the sequence change from T2, we
loose the sequence state. But we still write/replicate the value
generated from the sequence.

>> * Another serious issue seems to be streaming - if we already streamed
>> some of the changes, we can't iterate through them anymore.
>>
>> Also, having to walk the transactions over and over for each change, to
>> apply relevant sequence increments, that's mighty expensive. The other
>> approach needs to do that too, but walking the global hash table seems
>> much cheaper.
>>
>> The other issue this handling of aborted transactions - we need to apply
>> sequence increments even from those transactions, of course. The other
>> approach has this issue too, though.
>>
>>
>> (3) tracking sequences touched by transaction
>>
>> This is the approach proposed by Hannu Krosing. I haven't explored this
>> again yet, but I recall I wrote a PoC patch a couple months back.
>>
>> It seems to me most of the problems stems from trying to derive sequence
>> state from decoded WAL changes, which is problematic because of the
>> non-transactional nature of sequences (i.e. WAL for one transaction
>> affects other transactions in non-obvious ways). And this approach
>> simply works around that entirely - instead of trying to deduce the
>> sequence state from WAL, we'd make sure to write the current sequence
>> state (or maybe just ID of the sequence) at commit time. Which should
>> eliminate most of the complexity / problems, I think.
>>
>
> That sounds promising but I haven't thought in detail about that approach.
>

So, here's a patch doing that. It's a reworked/improved version of the
patch [1] shared in November.

It seems to be working pretty nicely. The behavior is a little bit
different, of course, because we only replicate "committed" changes, so
if you do nextval() in aborted transaction that is not replicated. Which
I think is fine, because we generally make no durability guarantees for
aborted transactions in general.

But there are a couple issues too:

1) locking

We have to read sequence change before the commit, but we must not allow
reordering (because then the state might go backwards again). I'm not
sure how serious impact could this have on performance.

2) dropped sequences

I'm not sure what to do about sequences dropped in the transaction. The
patch simply attempts to read the current sequence state before the
commit, but if the sequence was dropped (in that transaction), that
can't happen. I'm not sure if that's OK or not.

3) WAL record

To replicate the stuff the patch uses a LogicalMessage, but I guess a
separate WAL record would be better. But that's a technical detail.

regards

[1]
https://www.postgresql.org/message-id/2cd38bab-c874-8e0b-98e7-d9abaaf9806a@enterprisedb.com

>>
>> I'm not really sure what to do about this. All of those reworks seems
>> like an extensive redesign of the patch, and considering the last CF is
>> already over ... not great.
>>
>
> Yeah, I share the same feeling that even if we devise solutions to all
> the known problems it requires quite some time to ensure everything is
> correct.
>

True. Let's keep working on this for a bit more time and then we can
decide what to do.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-rework-sequence-decoding.patch text/x-patch 56.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-04-06 14:14:01 Re: Practical Timing Side Channel Attacks on Memory Compression
Previous Message Victor Wagner 2022-04-06 14:02:07 Re: OpenSSL deprectation warnings in Postgres 10-13