Re: Design of pg_stat_subscription_workers vs pgstats

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Design of pg_stat_subscription_workers vs pgstats
Date: 2022-02-03 04:33:08
Message-ID: CAD21AoCKxcVB9xh5o_Zm8-q0qukuQncNfBD6LVkY=my8ZJbqkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 2, 2022 at 4:36 PM David G. Johnston
<david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>
> On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
>> <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
>> >
>> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> >>
>> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>> >>
>> >> >
>> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> >> > feature to pass error-XID or error-LSN information to the worker
>> >> > whereas I'm also not sure of the advantages in storing all error
>> >> > information in a system catalog. Since what we need to do for this
>> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> >> > the transaction in question. The worker clears the error-XID/LSN after
>> >> > successfully applying or skipping the first non-empty transaction.
>> >> >
>> >>
>> >> Where do you propose to store this information?
>> >
>> >
>> > pg_subscription_worker
>> >
>> > The error message and context is very important. Just make sure it is only non-null when the worker state is "syncing failed" (or whatever term we use).
>> >
>> >
>>
>> Sure, but is this the reason you want to store all the error info in
>> the system catalog? I agree that providing more error info could be
>> useful and also possibly the previously failed (apply) xacts info as
>> well but I am not able to see why you want to have that sort of info
>> in the catalog. I could see storing info like err_lsn/err_xid that can
>> allow to proceed to apply worker automatically or to slow down the
>> launch of errored apply worker but not all sort of other error info
>> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
>> why you are insisting to make all the error info persistent via the
>> system catalog?
>
>
> I look at the catalog and am informed that the worker has stopped because of an error. I'd rather simply read the error message right then instead of having to go look at the log file. And if I am going to take an action in order to overcome the error I would have to know what that error is; so the error message is not something I can ignore. The error is an attribute of system state, and the catalog stores the current state of the (workers) system.
>
> I already explained that the concept of err_cnt is not useful. The fact that you include it here makes me think you are still thinking that this all somehow is meant to keep track of history. It is not. The workers are state machines and "error" is one of the states - with relevant attributes to display to the user, and system, while in that state. The state machine reporting does not care about historical states nor does it report on them. There is some uncertainty if we continue with the automatic re-launch; which, now that I write this, I can see where what you call err_cnt is effectively a count of how many times the worker re-launched without the underlying problem being resolved and thus encountered the same error. If we persist with the re-launch behavior then maybe err_cnt should be left in place - with the description for it basically being the ah-ha! comment I just made. In a world where we do not typically re-launch and simply re-try without being informed there is a change - such a count remains of minimal value.
>
> I don't really understand the confusion here though - this error data already exists in the pg_stat_subscription_workers stat collector view - the fact that I want to keep it around (just changing the reset behavior) - doesn't seem like it should be controversial. I, thinking as a user, really don't care about all of these implementation details. Whether it is a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial to me. The behavior I observe is what matters. As a developer I don't want to use the statistics collector because these are not statistics and the collector is unreliable. I don't know enough about the relevant differences between shared memory IPC and catalog tables to decide between them. But catalog tables seem like a lower bar to meet and seem like they can implement the user-facing requirements as I envision them.

I see that important information such as error-XID that can be used
for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
using system catalogs is a reasonable way for this purpose. But it's
still unclear to me why all error information that is currently shown
in pg_stat_subscription_workers view, including error-XID and the
error message, relation OID, action, etc., need to be stored in the
catalog. The information other than error-XID doesn't necessarily need
to be reliable compared to error-XID. I think we can have
error-XID/LSN in the pg_subscription catalog and have other error
information in pg_stat_subscription_workers view. After the user
checks the current status of logical replication by checking
error-XID/LSN, they can check pg_stat_subscription_workers for
details.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-02-03 04:46:26 Re: support for CREATE MODULE
Previous Message Julien Rouhaud 2022-02-03 04:28:05 Re: Unclear problem reports