Re: Design of pg_stat_subscription_workers vs pgstats

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design of pg_stat_subscription_workers vs pgstats
Date: 2022-02-02 07:36:08
Message-ID: CAKFQuwYHFkW8fP_a62wk-YBb4o+n9UXG4Ji3E4O9DwZrv0jgQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
> <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>
> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> wrote:
> >>
> >> >
> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
> >> > feature to pass error-XID or error-LSN information to the worker
> >> > whereas I'm also not sure of the advantages in storing all error
> >> > information in a system catalog. Since what we need to do for this
> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
> >> > the transaction in question. The worker clears the error-XID/LSN after
> >> > successfully applying or skipping the first non-empty transaction.
> >> >
> >>
> >> Where do you propose to store this information?
> >
> >
> > pg_subscription_worker
> >
> > The error message and context is very important. Just make sure it is
> only non-null when the worker state is "syncing failed" (or whatever term
> we use).
> >
> >
>
> Sure, but is this the reason you want to store all the error info in
> the system catalog? I agree that providing more error info could be
> useful and also possibly the previously failed (apply) xacts info as
> well but I am not able to see why you want to have that sort of info
> in the catalog. I could see storing info like err_lsn/err_xid that can
> allow to proceed to apply worker automatically or to slow down the
> launch of errored apply worker but not all sort of other error info
> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
> why you are insisting to make all the error info persistent via the
> system catalog?
>

I look at the catalog and am informed that the worker has stopped because
of an error. I'd rather simply read the error message right then instead
of having to go look at the log file. And if I am going to take an action
in order to overcome the error I would have to know what that error is; so
the error message is not something I can ignore. The error is an attribute
of system state, and the catalog stores the current state of the (workers)
system.

I already explained that the concept of err_cnt is not useful. The fact
that you include it here makes me think you are still thinking that this
all somehow is meant to keep track of history. It is not. The workers are
state machines and "error" is one of the states - with relevant attributes
to display to the user, and system, while in that state. The state machine
reporting does not care about historical states nor does it report on
them. There is some uncertainty if we continue with the automatic
re-launch; which, now that I write this, I can see where what you call
err_cnt is effectively a count of how many times the worker re-launched
without the underlying problem being resolved and thus encountered the same
error. If we persist with the re-launch behavior then maybe err_cnt should
be left in place - with the description for it basically being the ah-ha!
comment I just made. In a world where we do not typically re-launch and
simply re-try without being informed there is a change - such a count
remains of minimal value.

I don't really understand the confusion here though - this error data
already exists in the pg_stat_subscription_workers stat collector view -
the fact that I want to keep it around (just changing the reset behavior) -
doesn't seem like it should be controversial. I, thinking as a user,
really don't care about all of these implementation details. Whether it is
a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial
to me. The behavior I observe is what matters. As a developer I don't
want to use the statistics collector because these are not statistics and
the collector is unreliable. I don't know enough about the relevant
differences between shared memory IPC and catalog tables to decide between
them. But catalog tables seem like a lower bar to meet and seem like they
can implement the user-facing requirements as I envision them.

David J.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2022-02-02 07:37:13 Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Previous Message Teodor Sigaev 2022-02-02 07:34:49 Re: Pluggable toaster