From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Design of pg_stat_subscription_workers vs pgstats |
Date: | 2022-02-18 08:26:04 |
Message-ID: | CAD21AoA-MVZcMA1iYwbtvTCOHx1SwG8amQUTT8j2ssh2zuc6OQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Feb 16, 2022 at 8:36 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Wed, Feb 16, 2022 at 5:49 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Tue, Feb 15, 2022 at 11:56 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > On 2022-02-03 13:33:08 +0900, Masahiko Sawada wrote:
> > > > I see that important information such as error-XID that can be used
> > > > for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
> > > > using system catalogs is a reasonable way for this purpose. But it's
> > > > still unclear to me why all error information that is currently shown
> > > > in pg_stat_subscription_workers view, including error-XID and the
> > > > error message, relation OID, action, etc., need to be stored in the
> > > > catalog. The information other than error-XID doesn't necessarily need
> > > > to be reliable compared to error-XID. I think we can have
> > > > error-XID/LSN in the pg_subscription catalog and have other error
> > > > information in pg_stat_subscription_workers view. After the user
> > > > checks the current status of logical replication by checking
> > > > error-XID/LSN, they can check pg_stat_subscription_workers for
> > > > details.
> > >
> > > The stats system isn't geared towards storing error messages and
> > > such. Generally it is about storing counts of events etc, not about
> > > information about a single event. Obviously there are a few cases where that
> > > boundary isn't that clear...
> > >
> >
> > True, in the beginning, we discussed this information to be stored in
> > system catalog [1] (See .... Also, I am thinking that instead of a
> > stat view, do we need to consider having a system table .. ) but later
> > discussion led us to store this as stats.
> >
> > > IOW, storing information like:
> > > - subscription oid
> > > - retryable error count
> > > - hard error count
> > > - #replicated inserts
> > > - #replicated updates
> > > - #replicated deletes
> > >
> > > is what pgstats is for.
> > >
> >
> > Some of this and similar ((like error count, last_error_time)) is
> > present in stats and something on the lines of the other information
> > is proposed in [2] (xacts successfully replicated (committed),
> > aborted, etc) to be stored along with it.
> >
> > > But not
> > > - subscription oid
> > > - error message
> > > - xid of error
> > > - ...
> > >
> >
> > I think from the current set of things we are capturing, the other
> > thing in this list will be error_command (insert/update/delete..) and
> > or probably error_code. So, we can keep this information in a system
> > table.
>
> Agreed. Also, we can have also commit-LSN or replace error-XID with error-LSN?
>
> >
> > Based on this discussion, it appears to me what we want here is to
> > store the error info in some persistent way (system table) and the
> > counters (both error and success) of logical replication in stats. If
> > we can't achieve this work (separation) in the next few weeks (before
> > the feature freeze) then I'll revert the work related to stats.
>
> There was an idea to use shmem to store error info but it seems to be
> better to use a system catalog to persist them.
>
> I'll summarize changes we discussed and make a plan of changes and
> feature designs toward the feature freeze (and v16). I think that once
> we get a consensus on them we can start implementation and move it
> forward.
>
Here is the summary of the discussion, changes, and plan.
1. Move some error information such as the error message to a new
system catalog, pg_subscription_error. The pg_subscription_error table
would have the following columns:
* sesubid : subscription Oid.
* serelid : relation Oid (NULL for apply worker).
* seerrlsn : commit-LSN or the error transaction.
* seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction.
* seerrmsg : error message
The tuple is inserted or updated when an apply worker or a tablesync
worker raises an error. If the same error occurs in a row, the update
is skipped. The tuple is removed in the following cases:
* the subscription is dropped.
* the table is dropped.
* the table is removed from the subscription.
* the worker successfully committed a non-empty transaction.
With this change, pg_stat_subscription_workers will be like:
* subid
* subname
* subrelid
* error_count
* last_error_timestamp
This view will be extended by adding transaction statistics proposed
on another thread[1].
2. Fix a bug in pg_stat_subscription_workers. As pointed out by
Andres, there is a bug in pg_stat_subscription_workers; it doesn't
drop entries for already-dropped tables. We need to fix it.
3. Show commit-LSN of the error transaction in errcontext. Currently,
we show XID and commit timestamp in the errcontext. But given that we
use LSN in pg_subscriptoin_error, it's better to show commit-LSN as
well (or instead of error-XID).
4. Skipping the particular conflicted transaction. There are two proposals:
4-1. Use the existing replication_origin_advance() SQL function. We
don't need to add any additional syntax, instead use
replication_origin_advance() with the error-LSN reported in either
pg_subscription_error or server logs to skip the particular
transaction.
4-2. Introduce a new syntax like ALTER SUBSCRIPTION ... SKIP. This
proposal further has two options: (1) the user specifies error-LSN
manually and (2) the user just enables skipping behavior and error-LSN
is automatically fetched from pg_subscription_error. In any way, the
command raises an error when there is no error entry in
pg_subscription_error.
We can discuss this item for details on the original thread.
5. Record skipped data to the system catalog, say
pg_subscription_conflict_history so that there is a chance to analyze
and recover. The pg_subscription_conflict_history I'm thinking is that
we record the following of all skipped changes:
* command (INSERT, UPDATE etc.)
* commit-LSN
* before row (in json format?)
* after row (in json format?)
Given that we end up writing a huge amount of history if the
transaction is very large and I think there are peoples who want to
check what changes will be skipped together before enabling skipping
behavior, I think it could be optional. Therefore I think we can
provide an option for ALTER SUBSCRIPTION ... SKIP to whether the
skipped data is recorded or not and to the
pg_subscription_conflict_history or server logs. We can discuss the
details in a new thread.
4 and 5 might be better introduced together but I think since the user
is able to check what changes will be skipped on the publisher side we
can do 5 for v16. Feedback and comments are very welcome.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
From | Date | Subject | |
---|---|---|---|
Next Message | osumi.takamichi@fujitsu.com | 2022-02-18 08:34:01 | RE: Failed transaction statistics to measure the logical replication progress |
Previous Message | Michael Paquier | 2022-02-18 08:22:36 | Re: shared_preload_libraries = 'pg_stat_statements' failing with installcheck (compute_query_id = auto) |