From: | "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com> |
---|---|
To: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Alexey Lesovsky <lesovsky(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Skipping logical replication transactions on subscriber side |
Date: | 2022-01-24 07:59:54 |
Message-ID: | CAKFQuwZyyN=9J_KGV1j_MiKz59rWXcPkCe5eYCEM4w-PhhhXag@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Jan 23, 2022 at 11:55 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:
> On Mon, Jan 24, 2022 at 1:49 PM David G. Johnston
> <david(dot)g(dot)johnston(at)gmail(dot)com> wrote:
> >
> > On Sun, Jan 23, 2022 at 8:35 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>
> >> > I really dislike the user experience this provides, and given it is
> new in v15 (and right now this table seems to exist solely to support this
> feature) changing this seems within the realm of possibility. I have to
> imagine these workers have a sense of local state that would just be "no
> errors, no need to touch pg_stat_subscription_workers at the end of this
> transaction's commit". It would save a local state of the error_xid and if
> a successfully committed transaction has that xid it would clear the
> error. The skip code path would also check for and see the matching xid
> value and clear the error. Even if the local state thing doesn't work, one
> catalog lookup per transaction seems like potentially reasonable overhead
> to incur here.
> >> >
> >>
> >> Are you telling to update the catalog to save error_xid when an error
> >> occurs? If so, that has many challenges like we are not supposed to
> >> perform any such operations when the transaction is in an error state.
> >> We have discussed this and other ideas in the beginning. I don't find
> >> any of your arguments convincing to change the basic approach here but
> >> I would like to see what others think on this matter?
> >>
> >
> > Then how does the table get updated to that state in the first place
> since it doesn't know the error details until there is an error?
>
> I think your idea is based on storing error information including XID
> is stored in the system catalog. I think that the reasons why we use
> the stats collector
I noticed this dynamic while skimming the patch (and also pondering why the
new worker table was not in a catalog chapter) but am only now fully
beginning to appreciate its impact on this discussion.
> to store error information including
last_error_xid are (1) as Amit mentioned, it would have many
> challenges if updating the catalog when the transaction is in an error
> state, and
I'm going on faith right now that this is a problem. But from my prior
outline I hope you can see why I find it surprising. Don't try to update a
catalog while in an error state. Get out of the error state first. e.g.,
A transient "holding pattern" would seem to work. Upon a server restart
the transient state would be forgotten, it would attempt to reapply the
wal, would see the same error, and would then go back into the transient
holding pattern. I do intend to read the other discussion on this
particular topic so a detailed rebuttal, if warranted, can be withheld.
> (2) we can store more information such as error messages,
> action, etc. other than XID so that users can identify that the
> reported error is a conflict error but not other types of error such
> as OOM error.
I mentioned only XID because of the focus on SKIP. The other data already
present in that table is ok. Whether we use a catalog or the stats
collector seems irrelevant. If anything the catalog makes more sense -
calling an error message a statistic is a bit of a reach.
>Similarly, the same is true
>for clearing subskipxid. I agree that it's useful if
>pg_subscription.subskipxid is automatically set when executing ALTER
>SUBSCRIPTION SKIP but it might not work in some cases because of this
>restriction. For these reasons to me, it makes sense to store
>subscribers' error information by using the stats collector.
I'm confused - pg_subscription is a catalog, not a stat view. Why is it
affected?
I don't see how point 2 prevents using a system catalog. I accept point 1
as true but will need to read some of the prior discussion to really
understand it.
When it comes to reporting a message to the stats collector, we need
> to note that it's not guaranteed that all messages arrive at the stats
> collector. Therefore, last_error_xid doesn't not necessarily get
> updated after the worker reports an error.
You'll forgive me for not considering this due to its apparent lack of
mention in the documentation [*] and it's arguable classification as a POLA
violation.
[*]
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION
What I do read there seems compatible with the desired user experience.
500ms lag, idle transaction oriented, reset upon unclean shutdown, and
consumers seeing a stable transactional view: none of these seem like
show-stoppers.
Anyway, even if subskipxid is automatically set when ALTER
> SUBSCRIPTION SKIP, I think we need to provide a way to clear it as the
> current patch does (setting NONE) just in case.
>
With my suggestion of requiring a matching xid the whole option for
skip_xid = { xid | NONE } remains.
> 5(out). wait for the user to manually restart the replication stream
>
> Do you mean that there always is user intervention after error so the
> replication stream can resume?
>
That is my working assumption. It doesn't seem like the system would
auto-resume without a DBA doing something (I'll attribute a server crash to
the DBA for convenience).
Apparently I need to read more about how the system works today to
understand how this varies from and integrates with today's user experience.
That said, at present my two dislikes:
1) ALTER SYSTEM SKIP accepts any xid value (I need to consider further the
timing of when this resets to zero)
2) pg_stat_subscription_worker.last_error_* fields remain populated even
while the system is in a normal operating state.
are preventing me from preferring this patch over the status quo (yes, I
know the 2nd point is about a committed feature). Regardless of how far
off I may be regarding our technical ability to change them to a more (IMO)
user-friendly design.
David J.
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2022-01-24 08:02:43 | Re: XLogReadRecord() error in XlogReadTwoPhaseData() |
Previous Message | Michael Paquier | 2022-01-24 07:55:31 | Re: typos |