From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitry Astapov <dastapov(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #17619: AllocSizeIsValid violation in parallel hash join |
Date: | 2022-09-27 19:15:19 |
Message-ID: | CA+hUKGJV54w8jVqdBcpP7LaCL8PhcEhT97-nfrTcD2rdKCcteA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Wed, Sep 28, 2022 at 7:33 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Sep 27, 2022 at 9:44 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Right, the missing piece is the intentional clobber.
>
> That does seem like the best place to start. The attached patch adds
> clobbering that works exactly as you'd expect. This approach is
> obviously correct. It also doesn't require any reasoning about
> Valgrind's treatment of memory mappings for shared memory, which is
> quite complicated given the inconsistent rules about who initializes
> what memory (if it's leader or workers).
>
> I find that the tests pass with this patch -- so it probably won't
> catch the bug that Thomas mentioned via running the tests (at least
> not reliably). However, if I revert parallel VACUUM bugfix commit
> 662ba729 and then run the tests, they fail very reliably, in several
> places. That seems like a big improvement.
The reason it doesn't catch that bug on master is because that npages
shmem variable is only used to prevent further reading once a scan
hits the end of a shared tuplestore chunk and needs to decide whether
to read a new one, but if a chunk is partially filled then we end the
scan sooner because there's a number-of-items counter in the chunk
header. I noticed because the test module I wrote to study Dmitry's
report fills chunks exactly to the end, so I assume the clobber patch
+ that test module patch would reveal the problem.
I was assuming it didn't break the case you mentioned because that's
just stats counters (maybe those finish up wrong but that's probably
not a failure), but now it sounds like you've seen another reason.
> I believe that Thomas was going to do something like this anyway. I'm
> happy to leave it up to him, but I can pursue this separately if that
> makes sense.
Why not clobber "lower down" in dsm_create(), as I showed? You don't
have to use the table-of-contents mechanism to use DSM memory.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2022-09-27 21:16:41 | Re: BUG #17619: AllocSizeIsValid violation in parallel hash join |
Previous Message | Peter Geoghegan | 2022-09-27 18:32:53 | Re: BUG #17619: AllocSizeIsValid violation in parallel hash join |