From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Jakub Wartak <Jakub(dot)Wartak(at)tomtom(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: WIP: WAL prefetch (another approach) |
Date: | 2021-05-04 12:37:22 |
Message-ID: | f2be6caa-5a7a-990b-c56e-a29454ae1cee@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/3/21 7:42 AM, Thomas Munro wrote:
> On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> That last point means that there was some hard-to-hit problem even
>> before any of the recent WAL-related changes. However, 323cbe7c7
>> (Remove read_page callback from XLogReader) increased the failure
>> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
>> referenced data) seems to have increased it by another factor of 4.
>> But it looks like f003d9f87 (Add circular WAL decoding buffer)
>> didn't materially change the failure rate.
>
> Oh, wow. There are several surprising results there. Thanks for
> running those tests for so long so that we could see the rarest
> failures.
>
> Even if there are somehow *two* causes of corruption, one preexisting
> and one added by the refactoring or decoding patches, I'm struggling
> to understand how the chance increases with 1d2575, since that only
> adds code that isn't reached when not enabled (though I'm going to
> re-review that).
>
>> Considering that 323cbe7c7 was supposed to be just refactoring,
>> and 1d257577e is allegedly disabled-by-default, these are surely
>> not the results I was expecting to get.
>
> +1
>
>> It seems like it's still an open question whether all this is
>> a real bug, or flaky hardware. I have seen occasional kernel
>> freezeups (or so I think -- machine stops responding to keyboard
>> or network input) over the past year or two, so I cannot in good
>> conscience rule out the flaky-hardware theory. But it doesn't
>> smell like that kind of problem to me. I think what we're looking
>> at is a timing-sensitive bug that was there before (maybe long
>> before?) and these commits happened to make it occur more often
>> on this particular hardware. This hardware is enough unlike
>> anything made in the past decade that it's not hard to credit
>> that it'd show a timing problem that nobody else can reproduce.
>
> Hmm, yeah that does seem plausible. It would be nice to see a report
> from any other system though. I'm still trying, and reviewing...
>
FWIW I've ran the test (make installcheck-parallel in a loop) on four
different machines - two x86_64 ones, and two rpi4. The x86 boxes did
~1000 rounds each (and one of them had 5 local replicas) without any
issue. The rpi4 machines did ~50 rounds each, also without failures.
Obviously, it's possible there's something that neither of those (very
different systems) triggers, but I'd say it might also be a hint that
this really is a hw issue on the old ppc macs. Or maybe something very
specific to that arch.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2021-05-04 13:20:15 | Re: Identify missing publications from publisher while create/alter subscription. |
Previous Message | Dilip Kumar | 2021-05-04 12:11:06 | Re: Race condition in recovery? |