Re: Use read streams in pg_visibility

From: Noah Misch <noah(at)leadboat(dot)com>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use read streams in pg_visibility
Date: 2024-08-23 19:01:24
Message-ID: 20240823190124.9e@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > > The downside of this approach is there are too many "vmbuffer is valid
> > > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > > vmbuffer needs to be read again" cases in the read stream version (700
> > > vs 20 for the 6 GB table). This is caused by the callback function of
> > > the read stream reading a new vmbuf while getting next block numbers.
> > > So, vmbuf is wrong when we are checking visibility map bits that might
> > > have changed while we were acquiring the page lock.
> >
> > Did the test that found 700 "read again" use a different procedure than the
> > one shared as setup.sh down-thread? Using that script alone, none of the vm
> > bits are set, so I'd not expect any heap page reads.
>
> Yes, it is basically the same script but the query is 'SELECT
> pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts. But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure. Thoughts?

> > The "vmbuffer needs to be read again" regression fits what I would expect the
> > v1 patch to do with a table having many vm bits set. In general, I think the
> > fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> > while collect_corrupt_items() works on another vmbuffer. Much of the time,
> > they'll be the same buffer. It could be as simple as that, but you could
> > consider further optimizations like these:
>
> Thanks for the suggestion. Keeping two vmbuffers lowered the count of
> "read-again" cases to ~40. I ran the test again and the timing
> improvement is ~5% now.

> I think the number of "read-again" cases are low enough now.

Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

> Does creating something like
> pg_general_read_stream_next_block() in read_stream code and exporting
> it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there. How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

> > The callback doesn't return blocks having zero vm bits, so the blkno variable
> > is not accurate. I didn't test, but I think the loop's "Recheck to avoid
> > returning spurious results." looks at the bit for the wrong block. If that's
> > what v1 does, could you expand the test file to catch that? For example, make
> > a two-block table with only the second block all-visible.
>
> Yes, it was not accurate. I am getting blockno from the buffer now. I
> checked and confirmed it is working as expected by manually logging
> blocknos returned from the read stream. I am not sure how to add a
> test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

> There is one behavioral change introduced with the patches. It could
> happen when collect_corrupt_items() is called with both all_visible
> and all_frozen being true.
> -> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
> false in callback. Callback returns this block number because
> VM_ALL_FROZEN() is true.
> -> At the /* Recheck to avoid returning spurious results. */ part, we
> should only check VM_ALL_FROZEN() again as this was returning true in
> the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
> VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
> (vice versa of callback).
> -> We were skipping this block in the master but the patched version
> does not skip that.
>
> Is this a problem? If this is a problem, a solution might be to
> callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
> per_buffer_data.

No, I don't consider that a problem. That's not making us do additional I/O,
just testing more bits within the pages we're already loading. The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-08-23 19:05:34 Re: On disable_cost
Previous Message Tom Lane 2024-08-23 18:48:35 Re: On disable_cost