Re: Use read streams in pg_visibility

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use read streams in pg_visibility
Date: 2024-08-27 07:49:19
Message-ID: CAN55FZ2NuoNdikpo5+OXeYaS+=+4upo4hug4ZV+Pj_nn7comgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > > The downside of this approach is there are too many "vmbuffer is valid
> > > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > > vmbuffer needs to be read again" cases in the read stream version (700
> > > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > > the read stream reading a new vmbuf while getting next block numbers.
> > > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > > have changed while we were acquiring the page lock.
> > >
> > > Did the test that found 700 "read again" use a different procedure than the
> > > one shared as setup.sh down-thread? Using that script alone, none of the vm
> > > bits are set, so I'd not expect any heap page reads.
> >
> > Yes, it is basically the same script but the query is 'SELECT
> > pg_check_visible('${TABLE}'::regclass);'.
>
> I gather the scripts deal exclusively in tables with no vm bits set, so
> pg_visibility did no heap page reads under those scripts. But the '700 "read
> again"' result suggests either I'm guessing wrong, or that result came from a
> different test procedure. Thoughts?

Sorry, yes. You need to run VACUUM on the test table before running
the query. The script is attached. You can run the script with
[corrupt | visibility] arguments to test the related patches.

> > > The "vmbuffer needs to be read again" regression fits what I would expect the
> > > v1 patch to do with a table having many vm bits set. In general, I think the
> > > fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> > > while collect_corrupt_items() works on another vmbuffer. Much of the time,
> > > they'll be the same buffer. It could be as simple as that, but you could
> > > consider further optimizations like these:
> >
> > Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> > "read-again" cases to ~40. I ran the test again and the timing
> > improvement is ~5% now.
>
> > I think the number of "read-again" cases are low enough now.
>
> Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
> lookups, not an increase in I/O.
>
> > Does creating something like
> > pg_general_read_stream_next_block() in read_stream code and exporting
> > it makes sense?
>
> To me, that name isn't conveying the function's behavior, and the words "pg_"
> and "general_" aren't adding information there. How about one of these names
> or similar:
>
> seq_read_stream_cb
> sequential_read_stream_cb
> block_range_read_stream_cb

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

> > > The callback doesn't return blocks having zero vm bits, so the blkno variable
> > > is not accurate. I didn't test, but I think the loop's "Recheck to avoid
> > > returning spurious results." looks at the bit for the wrong block. If that's
> > > what v1 does, could you expand the test file to catch that? For example, make
> > > a two-block table with only the second block all-visible.
> >
> > Yes, it was not accurate. I am getting blockno from the buffer now. I
> > checked and confirmed it is working as expected by manually logging
> > blocknos returned from the read stream. I am not sure how to add a
> > test case for this.
>
> VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
> TID, even if rolled back, clears both vm bits for the TID's page. Past tests
> like that had instability problems. One cause is a concurrent session's XID
> or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
> have been one of the countermeasures, but I don't remember clearly. Hence,
> please search the archives or the existing pg_visibility tests for how we
> dealt with that. It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

> > There is one behavioral change introduced with the patches. It could
> > happen when collect_corrupt_items() is called with both all_visible
> > and all_frozen being true.
> > -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
> > false in callback. Callback returns this block number because
> > VM_ALL_FROZEN() is true.
> > -> At the /* Recheck to avoid returning spurious results. */ part, we
> > should only check VM_ALL_FROZEN() again as this was returning true in
> > the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
> > VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
> > (vice versa of callback).
> > -> We were skipping this block in the master but the patched version
> > does not skip that.
> >
> > Is this a problem? If this is a problem, a solution might be to
> > callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
> > per_buffer_data.
>
> No, I don't consider that a problem. That's not making us do additional I/O,
> just testing more bits within the pages we're already loading. The difference
> in behavior is user-visible, but it's the same behavior change the user would
> get if the timing had been a bit different.

Thanks for the clarification.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v3-0001-Add-general-use-struct-and-callback-to-read-strea.patch application/x-patch 2.8 KB
v3-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patch application/x-patch 1.9 KB
v3-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patch application/x-patch 2.2 KB
v3-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patch application/x-patch 2.4 KB
v3-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patch application/x-patch 5.2 KB
setup.sh application/x-sh 977 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-08-27 08:12:24 Re: Convert sepgsql tests to TAP
Previous Message jian he 2024-08-27 06:53:00 Re: POC, WIP: OR-clause support for indexes