Re: Use read streams in pg_visibility

From: Noah Misch <noah(at)leadboat(dot)com>
To: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use read streams in pg_visibility
Date: 2024-08-20 18:47:42
Message-ID: 20240820184742.f2.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> 2- collect_corrupt_items()
>
> This one is more complicated. The read stream callback function loops
> until it finds a suitable block to read. So, if the callback returns
> an InvalidBlockNumber; it means that the stream processed all possible
> blocks and the stream should be finished. There is ~3% timing
> improvement with this change. I started the server with the default
> settings and created a 6 GB table. Then run 100 times
> pg_check_visible() by clearing the OS cache between each run.
>
> The downside of this approach is there are too many "vmbuffer is valid
> but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> vmbuffer needs to be read again" cases in the read stream version (700
> vs 20 for the 6 GB table). This is caused by the callback function of
> the read stream reading a new vmbuf while getting next block numbers.
> So, vmbuf is wrong when we are checking visibility map bits that might
> have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
to stop calling the ReadStreamBlockNumberCB for awhile. This could help if
nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
block. For example, if one block out of every 33000 is all-visible, every
heap block we visit has a different vmbuffer. It's likely not optimal for
the callback to pin and unpin 20 vmbuffers, then have
collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility
could notice this trend and request a stop of the callbacks until more of
the heap block work completes. If pg_visibility is going to be the only
place in the code with this use case, it's probably not worth carrying the
extra API just for pg_visibility. However, if we get a stronger use case
later, pg_visibility could be another beneficiary.

> +/*
> + * Callback function to get next block for read stream object used in
> + * collect_visibility_data() function.
> + */
> +static BlockNumber
> +collect_visibility_data_read_stream_next_block(ReadStream *stream,
> + void *callback_private_data,
> + void *per_buffer_data)
> +{
> + struct collect_visibility_data_read_stream_private *p = callback_private_data;
> +
> + if (p->blocknum < p->nblocks)
> + return p->blocknum++;
> +
> + return InvalidBlockNumber;

This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block(). While not a big problem,
I think it's time to have a general-use callback for block range scans. The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

> +static BlockNumber
> +collect_corrupt_items_read_stream_next_block(ReadStream *stream,
> + void *callback_private_data,
> + void *per_buffer_data)
> +{
> + struct collect_corrupt_items_read_stream_private *p = callback_private_data;
> +
> + for (; p->blocknum < p->nblocks; p->blocknum++)
> + {
> + bool check_frozen = false;
> + bool check_visible = false;
> +
> + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
> + check_frozen = true;
> + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
> + check_visible = true;
> + if (!check_visible && !check_frozen)
> + continue;

If a vm has zero bits set, this loop will scan the entire vm before returning.
Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

> @@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
> items->count = 64;
> items->tids = palloc(items->count * sizeof(ItemPointerData));
>
> + p.blocknum = 0;
> + p.nblocks = nblocks;
> + p.rel = rel;
> + p.vmbuffer = &vmbuffer;
> + p.all_frozen = all_frozen;
> + p.all_visible = all_visible;
> + stream = read_stream_begin_relation(READ_STREAM_FULL,
> + bstrategy,
> + rel,
> + MAIN_FORKNUM,
> + collect_corrupt_items_read_stream_next_block,
> + &p,
> + 0);
> +
> /* Loop over every block in the relation. */
> for (blkno = 0; blkno < nblocks; ++blkno)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Since the callback skips blocks, this function should use the
read_stream_reset() style of looping:

while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) ...

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-08-20 18:58:13 Re: Adding clarification to description of IPC wait events XactGroupUpdate and ProcArrayGroupUpdate
Previous Message Bruce Momjian 2024-08-20 18:41:18 Re: Partial aggregates pushdown