Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure with summarize_wal enabled during pg_createsubscriber
Date: 2024-07-29 16:20:45
Message-ID: CA+TgmoZrUjqA0dnLYQ6sunVmBKBAm0OLP49hKvZOuzq7v-0iKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> 0002 could also use pg_indent and pgperltidy. And I have couple other
> notes regarding 0002.

As Tom said elsewhere, we don't have a practice of requiring perltidy
for every commit, and I normally don't bother. The tree is
pgindent-clean currently so I believe I got that part right.

> > In the process of fixing these bugs, I realized that the logic to wait
> > for WAL summarization to catch up was spread out in a way that made
> > it difficult to reuse properly, so this code refactors things to make
> > it easier.
>
> It would be nice to split refactoring out of material logic changed.
> This way it would be easier to review and would look cleaner in the
> git history.

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

> > To make this fix work, also teach the WAL summarizer that after a
> > promotion has occurred, no more WAL can appear on the previous
> > timeline: previously, the WAL summarizer wouldn't switch to the new
> > timeline until we actually started writing WAL there, but that meant
> > that when the startup process was waiting for the WAL summarizer, it
> > was waiting for an action that the summarizer wasn't yet prepared to
> > take.
>
> I think this is a pretty long sentence, and I'm not sure I can
> understand it correctly. Does startup process wait for the WAL
> summarizer without this patch? If not, it's not clear for me that
> word "previously" doesn't distribute to this part of sentence.
> Breaking this into multiple sentences could improve the clarity for
> me.

Yes, I think that phrasing is muddled. It's too late to amend the
commit message, but for posterity: I initially thought that I could
fix this bug by just teaching the startup process to wait for WAL
summarization before performing the .partial renaming, but that was
not enough by itself. The problem is that the WAL summarizer would
read all the records that were present in the final WAL file on the
old timeline, but it wouldn't actually write out a summary file,
because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
timeline switch point. Since no WAL had appeared on the new timeline
yet, it didn't view the end of the old timeline as a switch point, so
it just sat there waiting, without ever writing out a file; and the
startup process sat there waiting for it. So the second part of the
fix is to make the WAL summarizer realize that once the startup
process has chosen a new timeline ID, no more WAL is going to appear
on the old timeline, and thus it can (and should) write out the final
summary file for the old timeline and prepare to read WAL from the new
timeline.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-07-29 16:21:08 Re: Incremental backup from a streaming replication standby fails
Previous Message Andres Freund 2024-07-29 16:18:46 Re: pgsql: Fix double-release of spinlock