From: | Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Use read streams in pg_visibility |
Date: | 2024-08-27 07:49:19 |
Message-ID: | CAN55FZ2NuoNdikpo5+OXeYaS+=+4upo4hug4ZV+Pj_nn7comgg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > > The downside of this approach is there are too many "vmbuffer is valid
> > > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > > vmbuffer needs to be read again" cases in the read stream version (700
> > > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > > the read stream reading a new vmbuf while getting next block numbers.
> > > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > > have changed while we were acquiring the page lock.
> > >
> > > Did the test that found 700 "read again" use a different procedure than the
> > > one shared as setup.sh down-thread? Using that script alone, none of the vm
> > > bits are set, so I'd not expect any heap page reads.
> >
> > Yes, it is basically the same script but the query is 'SELECT
> > pg_check_visible('${TABLE}'::regclass);'.
>
> I gather the scripts deal exclusively in tables with no vm bits set, so
> pg_visibility did no heap page reads under those scripts. But the '700 "read
> again"' result suggests either I'm guessing wrong, or that result came from a
> different test procedure. Thoughts?
Sorry, yes. You need to run VACUUM on the test table before running
the query. The script is attached. You can run the script with
[corrupt | visibility] arguments to test the related patches.
> > > The "vmbuffer needs to be read again" regression fits what I would expect the
> > > v1 patch to do with a table having many vm bits set. In general, I think the
> > > fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> > > while collect_corrupt_items() works on another vmbuffer. Much of the time,
> > > they'll be the same buffer. It could be as simple as that, but you could
> > > consider further optimizations like these:
> >
> > Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> > "read-again" cases to ~40. I ran the test again and the timing
> > improvement is ~5% now.
>
> > I think the number of "read-again" cases are low enough now.
>
> Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
> lookups, not an increase in I/O.
>
> > Does creating something like
> > pg_general_read_stream_next_block() in read_stream code and exporting
> > it makes sense?
>
> To me, that name isn't conveying the function's behavior, and the words "pg_"
> and "general_" aren't adding information there. How about one of these names
> or similar:
>
> seq_read_stream_cb
> sequential_read_stream_cb
> block_range_read_stream_cb
I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.
> > > The callback doesn't return blocks having zero vm bits, so the blkno variable
> > > is not accurate. I didn't test, but I think the loop's "Recheck to avoid
> > > returning spurious results." looks at the bit for the wrong block. If that's
> > > what v1 does, could you expand the test file to catch that? For example, make
> > > a two-block table with only the second block all-visible.
> >
> > Yes, it was not accurate. I am getting blockno from the buffer now. I
> > checked and confirmed it is working as expected by manually logging
> > blocknos returned from the read stream. I am not sure how to add a
> > test case for this.
>
> VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
> TID, even if rolled back, clears both vm bits for the TID's page. Past tests
> like that had instability problems. One cause is a concurrent session's XID
> or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
> have been one of the countermeasures, but I don't remember clearly. Hence,
> please search the archives or the existing pg_visibility tests for how we
> dealt with that. It may not be problem for this particular test.
Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.
> > There is one behavioral change introduced with the patches. It could
> > happen when collect_corrupt_items() is called with both all_visible
> > and all_frozen being true.
> > -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
> > false in callback. Callback returns this block number because
> > VM_ALL_FROZEN() is true.
> > -> At the /* Recheck to avoid returning spurious results. */ part, we
> > should only check VM_ALL_FROZEN() again as this was returning true in
> > the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
> > VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
> > (vice versa of callback).
> > -> We were skipping this block in the master but the patched version
> > does not skip that.
> >
> > Is this a problem? If this is a problem, a solution might be to
> > callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
> > per_buffer_data.
>
> No, I don't consider that a problem. That's not making us do additional I/O,
> just testing more bits within the pages we're already loading. The difference
> in behavior is user-visible, but it's the same behavior change the user would
> get if the timing had been a bit different.
Thanks for the clarification.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Add-general-use-struct-and-callback-to-read-strea.patch | application/x-patch | 2.8 KB |
v3-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patch | application/x-patch | 1.9 KB |
v3-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patch | application/x-patch | 2.2 KB |
v3-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patch | application/x-patch | 2.4 KB |
v3-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patch | application/x-patch | 5.2 KB |
setup.sh | application/x-sh | 977 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2024-08-27 08:12:24 | Re: Convert sepgsql tests to TAP |
Previous Message | jian he | 2024-08-27 06:53:00 | Re: POC, WIP: OR-clause support for indexes |