Re: Implement waiting for wal lsn replay: reloaded

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Implement waiting for wal lsn replay: reloaded
Date: 2025-03-13 14:15:01
Message-ID: 09a98dc9-eeb1-471d-b990-072513c3d584@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I did a quick look at this patch. I haven't found any correctness
issues, but I have some general review comments and questions about the
grammar / syntax.

1) The sgml docs don't really show the syntax very nicely, it only shows
this at the beginning of wait_for.sgml:

WAIT FOR ( <replaceable class="parameter">parameter</replaceable>
'<replaceable class="parameter">value</replaceable>' [, ... ] ) ]

I kinda understand this comes from using the generic option list (I'll
get to that shortly), but I think it'd be much better to actually show
the "full" syntax here, instead of leaving the "parameters" to later.

2) The syntax description suggests "(" and ")" are required, but that
does not seem to be the case - in fact, it's not even optional, and when
I try using that, I get syntax error.

3) I have my doubts about using the generic_option_list for this. Yes, I
understand this allows using fewer reserved keywords, but it leads to
some weirdness and I'm not sure it's worth it. Not sure what the right
trade off is here.

Anyway, some examples of the weird stuff implied by this approach:

- it forces "," between the options, which is a clear difference from
what we do for every other command

- it forces everything to be a string, i.e. you can' say "TIMEOUT 10",
it has to be "TIMEOUT '10'"

I don't have a very strong opinion on this, but the result seems a bit
strange to me.

4) I'm not sure I understand the motivation of the "throw false" mode,
and I'm not sure I understand this description in the sgml docs:

On timeout, or if the server is promoted before
<parameter>lsn</parameter> is reached, an error is emitted,
as soon as <parameter>throw</parameter> is not specified or set to
true.
If <parameter>throw</parameter> is set to false, then the command
doesn't throw errors.

I find it a bit confusing. What is the use case for this mode?

5) One place in the docs says:

The target log sequence number to wait for.

Thie is literally the only place using "log sequence number" in our
code base, I'd just use "LSN" just like every other place.

6) The docs for the TIMEOUT parameter say this:

<varlistentry>
<term><replaceable class="parameter">timeout</replaceable></term>
<listitem>
<para>
When specified and greater than zero, the command waits until
<parameter>lsn</parameter> is reached or the specified
<parameter>timeout</parameter> has elapsed. Must be a non-
negative integer, the default is zero.
</para>
</listitem>
</varlistentry>

That doesn't say what unit does the option use. Is is seconds,
milliseconds or what?

In fact, it'd be nice to let users specify that in the value, similar
to other options (e.g. SET statement_timeout = '10s').

7) One place in the docs says this:

That is, after this function execution, the value returned by
<function>pg_last_wal_replay_lsn</function> should be greater ...

I think the reference to "function execution" is obsolete?

8) I find this confusing:

However, if <command>WAIT FOR</command> is
called on primary promoted from standby and <literal>lsn</literal>
was already replayed, then the <command>WAIT FOR</command> command
just exits immediately.

Does this mean running the WAIT command on a primary (after it was
already promoted) will exit immediately? Why does it matter that it
was promoted from a standby? Shouldn't it exit immediately even for
a standalone instance?

9) xlogwait.c

I think this should start with a basic "design" description of how the
wait is implemented, in a comment at the top of the file. That is, what
we keep in the shared memory, what happens during a wait, how it uses
the pairing heap, etc. After reading this comment I should understand
how it all fits together.

10) WaitForLSNReplay / WaitLSNWakeup

I think the function comment should document the important stuff (e.g.
return values for various situations, how it groups waiters into chunks
of 16 elements during wakeup, ...).

11) WaitLSNProcInfo / WaitLSNState

Does this need to be exposed in xlogwait.h? These structs seem private
to xlogwait.c, so maybe declare it there?

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2025-03-13 14:15:14 Re: Draft for basic NUMA observability
Previous Message vignesh C 2025-03-13 14:10:24 Re: Separate GUC for replication origins