From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Peter Smith <smithpb2250(at)gmail(dot)com> |
Cc: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Hou, Zhijie/侯 志杰 <houzj(dot)fnst(at)fujitsu(dot)com>, "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
Subject: | Re: Logical Replication of sequences |
Date: | 2024-07-20 15:18:16 |
Message-ID: | CALDaNm3Q-MB-SPCotzncR4Qtqb8_E1YL5VBfNYb-AtuvS53QRA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 12 Jul 2024 at 08:22, Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
>
> Hi Vignesh. Here are the rest of my comments for patch v20240705-0003.
> ======
> src/backend/catalog/pg_subscription.c
>
> 1. GetSubscriptionSequences
>
> +/*
> + * Get the sequences for the subscription.
> + *
> + * The returned list is palloc'ed in the current memory context.
> + */
>
> Is that comment right? The palloc seems to be done in
> CacheMemoryContext, not in the current context.
This function is removed and GetSubscriptionRelations is being used instead.
> ~
>
> 2.
> The code is very similar to the other function
> GetSubscriptionRelations(). In fact I did not understand how the 2
> functions know what they are returning:
>
> E.g. how does GetSubscriptionRelations not return sequences too?
> E.g. how does GetSubscriptionSequences not return relations too?
GetSubscriptionRelations can be used, so removed the
GetSubscriptionSequences function.
>
> 3. AlterSubscription_refresh
>
> - logicalrep_worker_stop(sub->oid, relid);
> + /* Stop the worker if relation kind is not sequence*/
> + if (relkind != RELKIND_SEQUENCE)
> + logicalrep_worker_stop(sub->oid, relid);
>
> Can you give more reasons in the comment why skip the stop for sequence worker?
>
> ~
>
> nitpick - period and space in the comment
>
> ~~~
>
> 8. logicalrep_sequence_sync_worker_find
>
> +/*
> + * Walks the workers array and searches for one that matches given
> + * subscription id.
> + *
> + * We are only interested in the sequence sync worker.
> + */
> +LogicalRepWorker *
> +logicalrep_sequence_sync_worker_find(Oid subid, bool only_running)
>
> There are other similar functions for walking the workers array to
> search for a worker. Instead of having different functions for
> different cases, wouldn't it be cleaner to combine these into a single
> function, where you pass a parameter (e.g. a mask of worker types that
> you are interested in finding)?
I will address this in a future version once the patch has become more stable.
> ~~~
>
> 11.
> - Assert(is_tablesync_worker == OidIsValid(relid));
> + Assert(is_tablesync_worker == OidIsValid(relid) ||
> is_sequencesync_worker == OidIsValid(relid));
>
> IIUC there is only a single sequence sync worker for handling all the
> sequences. So, what does the 'relid' actually mean here when there are
> multiple sequences?
Sequence sync workers will not have relid, modified the assert.
> ~~~
>
> 12. logicalrep_seqsyncworker_failuretime
> 12b.
> Curious if this had to be a separate exit handler or if may this could
> have been handled by the existing logicalrep_worker_onexit handler.
> See also other review comments int this post about this area --
> perhaps all this can be removed?
This function cannot be combined with logicalrep_worker_onexit as this
function should be called only in failure case and this exit handler
should be removed in case of success case.
This cannot be removed because of the following reason:
Consider the following situation: a sequence sync worker starts and
then encounters a failure while syncing sequences. At the same time, a
user initiates a "refresh publication sequences" operation. Given only
the start time, it's not possible to distinguish whether the sequence
sync worker failed or completed successfully. This is because the
"refresh publication sequences" operation would have re-added the
sequences, making it unclear whether the sync worker's failure or
success occurred.
>
> 14.
> The reason for the addition logic "(last_value + log_cnt)" is not
> obvious. I am guessing it might be related to code from
> 'nextval_internal' (fetch = log = fetch + SEQ_LOG_VALS;) but it is
> complicated. It is unfortunate that the field 'log_cnt' seems hardly
> commented anywhere at all.
>
> Also, I am not 100% sure if I trust the logic in the first place. The
> caller of this function is doing:
> sequence_value = fetch_sequence_data(conn, remoteid, &lsn);
> /* sets the sequence with sequence_value */
> SetSequenceLastValue(RelationGetRelid(rel), sequence_value);
>
> Won't that mean you can get to a situation where subscriber-side
> result of lastval('s') can be *ahead* from lastval('s') on the
> publisher? That doesn't seem good.
Added comments for "last_value + log_cnt"
Yes it can be ahead in subscribers. This will happen because every
change of the sequence is not wal logged. It is WAL logged once in
SEQ_LOG_VALS. This was discussed earlier and the sequence value being
ahead was ok.
https://www.postgresql.org/message-id/CA%2BTgmoaVLiKDD5vr1bzL-rxhMA37KCS_2xrqjbKVwGyqK%2BPCXQ%40mail.gmail.com
> 15.
> + /*
> + * Logical replication of sequences is based on decoding WAL records,
> + * describing the "next" state of the sequence the current state in the
> + * relfilenode is yet to reach. But during the initial sync we read the
> + * current state, so we need to reconstruct the WAL record logged when we
> + * started the current batch of sequence values.
> + *
> + * Otherwise we might get duplicate values (on subscriber) if we failed
> + * over right after the sync.
> + */
> + sequence_value = fetch_sequence_data(conn, remoteid, &lsn);
> +
> + /* sets the sequence with sequence_value */
> + SetSequenceLastValue(RelationGetRelid(rel), sequence_value);
>
> (This is related to some earlier review comment #14 above). IMO all
> this tricky commentary belongs in the function header of
> "fetch_sequence_data", where it should be describing that function's
> return value.
Moved it to fetch_sequence_data where pg_sequence_state is called to
avoid any confusion
> 17.
> Also, where does the number 100 come from? Why not 1000? Why not 10?
> Why have batching at all? Maybe there should be some comment to
> describe the reason and the chosen value.
Added a comment for this. I will do one round of testing with few
values and see if this value needs to be changed. I will share it
later.
> 20.
> + char relkind;
> +
> + if (!started_tx)
> + {
> + StartTransactionCommand();
> + started_tx = true;
> + }
> +
> + relkind = get_rel_relkind(rstate->relid);
> + if (relkind == RELKIND_SEQUENCE)
> + continue;
>
> I am wondering is it possible to put the relkind check can come
> *before* the TX code here, because in case there are *only* sequences
> then maybe every would be skipped and there would have been no need
> for any TX at all in the first place.
We need to start the transaction before calling get_rel_relkind, else
it will assert in SearchCatCacheInternal. So Skipping this.
> 21.
> + if (!started_tx)
> + {
> + StartTransactionCommand();
> + started_tx = true;
> + }
> +
> + relkind = get_rel_relkind(rstate->relid);
> + if (relkind != RELKIND_SEQUENCE || rstate->state != SUBREL_STATE_INIT)
> + continue;
>
> Wondering (like in review comment #20) if it is possible to swap those
> because maybe there was no reason for any TX if the other condition
> would always continue.
As transaction is required before calling get_rel_relkind, this cannot
be changed. So skipping this.
> ~~~
>
> 22.
> + if (nsyncworkers < max_sync_workers_per_subscription)
> + {
> + TimestampTz now = GetCurrentTimestamp();
> + if (!MyLogicalRepWorker->sequencesync_failure_time ||
> + TimestampDifferenceExceeds(MyLogicalRepWorker->sequencesync_failure_time,
> + now, wal_retrieve_retry_interval))
> + {
> + MyLogicalRepWorker->sequencesync_failure_time = 0;
>
> It seems to me that storing 'sequencesync_failure_time' logic may be
> unnecessarily complicated. Can't the same "throttling" be achieved by
> storing the synchronization worker 'start time' instead of 'fail
> time', in which case then you won't have to mess around with
> considering if the sync worker failed or just exited normally etc? You
> might also be able to remove all the
> logicalrep_seqsyncworker_failuretime() exit handler code.
Consider the following situation: a sequence sync worker starts and
then encounters a failure while syncing sequences. At the same time, a
user initiates a "refresh publication sequences" operation. Given only
the start time, it's not possible to distinguish whether the sequence
sync worker failed or completed successfully. This is because the
"refresh publication sequences" operation would have re-added the
sequences, making it unclear whether the sync worker's failure or
success occurred.
> 28b.
> Can you explain why the expected sequence value its 132, because
> AFAICT you only called nextval('s') 100 times, so why isn't it 100?
> My guess is that it seems to be related to code in "nextval_internal"
> (fetch = log = fetch + SEQ_LOG_VALS;) but it kind of defies
> expectations of the test, so if it really is correct then it needs
> commentary.
I felt adding comments for one of the tests should be enough, So I did
not add the comment for all of the tests.
> Actually, I found other regression test code that deals with this:
> -- log_cnt can be higher if there is a checkpoint just at the right
> -- time, so just test for the expected range
> SELECT last_value, log_cnt IN (31, 32) AS log_cnt_ok, is_called FROM
> foo_seq_new;
>
> Do you have to do something similar? Or is this a bug? See my other
> review comments for function fetch_sequence_data in sequencesync.c
The comments in nextval_internal says:
* If this is the first nextval after a checkpoint, we must force a new
* WAL record to be written anyway, else replay starting from the
* checkpoint would fail to advance the sequence past the logged values.
* In this case we may as well fetch extra values.
I have increased the checkpoint for this test, so this issue will not occur.
All the other comments were fixed and the same is available in the
v20240720 version attached at [1].
Regards,
Vignesh
From | Date | Subject | |
---|---|---|---|
Next Message | Ahmed Yarub Hani Al Nuaimi | 2024-07-20 16:00:05 | Re: Lock-free compaction. Why not? |
Previous Message | vignesh C | 2024-07-20 15:08:18 | Re: Logical Replication of sequences |