From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
Cc: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "Shinoda, Noriyoshi (PN Japan FSIP)" <noriyoshi(dot)shinoda(at)hpe(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pg15b3: recovery fails with wal prefetch enabled |
Date: | 2022-09-05 01:28:12 |
Message-ID: | CA+hUKGL=+0nF8o8xG5DDUepG0ZxgDXusF=Jqtd7FmtFvmR1Gmg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Sep 2, 2022 at 6:20 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> ... The active ingredient here is a setting of
> maintenance_io_concurency=0, which runs into a dumb accounting problem
> of the fencepost variety and incorrectly concludes it's reached the
> end early. Setting it to 3 or higher allows his system to complete
> recovery. I'm working on a fix ASAP.
The short version is that when tracking the number of IOs in progress,
I had two steps in the wrong order in the algorithm for figuring out
whether IO is saturated. Internally, the effect of
maintenance_io_concurrency is clamped to 2 or more, and that mostly
hides the bug until you try to replay a particular sequence like
Justin's with such a low setting. Without that clamp, and if you set
it to 1, then several of our recovery tests fail.
That clamp was a bad idea. What I think we really want is for
maintenance_io_concurrency=0 to disable recovery prefetching exactly
as if you'd set recovery_prefetch=off, and any other setting including
1 to work without clamping.
Here's the patch I'm currently testing. It also fixes a related
dangling reference problem with very small maintenance_io_concurrency.
I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl. I
think that failure case should report an error, no? I find it strange
that we end recovery in silence. That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state. I'm still looking into that.
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-recovery_prefetch-with-low-maintenance_io_concur.patch | text/x-patch | 5.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2022-09-05 02:28:48 | Re: Postmaster self-deadlock due to PLT linkage resolution |
Previous Message | Jonathan S. Katz | 2022-09-05 00:50:33 | Re: POC: GROUP BY optimization |