Re: Use read streams in pg_visibility

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use read streams in pg_visibility
Date: 2024-08-23 11:20:06
Message-ID: CAN55FZ3o1CoykUOVMxdAebPnseZ7vgxnqTjpKe1s4hHVkcyDFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the review and feedback!

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:
> > 2- collect_corrupt_items()
> >
> > This one is more complicated. The read stream callback function loops
> > until it finds a suitable block to read. So, if the callback returns
> > an InvalidBlockNumber; it means that the stream processed all possible
> > blocks and the stream should be finished. There is ~3% timing
> > improvement with this change. I started the server with the default
> > settings and created a 6 GB table. Then run 100 times
> > pg_check_visible() by clearing the OS cache between each run.
> >
> > The downside of this approach is there are too many "vmbuffer is valid
> > but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
> > vmbuffer needs to be read again" cases in the read stream version (700
> > vs 20 for the 6 GB table). This is caused by the callback function of
> > the read stream reading a new vmbuf while getting next block numbers.
> > So, vmbuf is wrong when we are checking visibility map bits that might
> > have changed while we were acquiring the page lock.
>
> Did the test that found 700 "read again" use a different procedure than the
> one shared as setup.sh down-thread? Using that script alone, none of the vm
> bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

> The "vmbuffer needs to be read again" regression fits what I would expect the
> v1 patch to do with a table having many vm bits set. In general, I think the
> fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
> while collect_corrupt_items() works on another vmbuffer. Much of the time,
> they'll be the same buffer. It could be as simple as that, but you could
> consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

> - Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
> lookup when the other Buffer var already has the block you want.
>
> - [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
> to stop calling the ReadStreamBlockNumberCB for awhile. This could help if
> nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
> block. For example, if one block out of every 33000 is all-visible, every
> heap block we visit has a different vmbuffer. It's likely not optimal for
> the callback to pin and unpin 20 vmbuffers, then have
> collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility
> could notice this trend and request a stop of the callbacks until more of
> the heap block work completes. If pg_visibility is going to be the only
> place in the code with this use case, it's probably not worth carrying the
> extra API just for pg_visibility. However, if we get a stronger use case
> later, pg_visibility could be another beneficiary.

I think the number of "read-again" cases are low enough now. I am
working on 'using ReadRecentBuffer() or a similar technique' but that
may take more time, so I attached patches without these optimizations.

> > +/*
> > + * Callback function to get next block for read stream object used in
> > + * collect_visibility_data() function.
> > + */
> > +static BlockNumber
> > +collect_visibility_data_read_stream_next_block(ReadStream *stream,
> > + void *callback_private_data,
> > + void *per_buffer_data)
> > +{
> > + struct collect_visibility_data_read_stream_private *p = callback_private_data;
> > +
> > + if (p->blocknum < p->nblocks)
> > + return p->blocknum++;
> > +
> > + return InvalidBlockNumber;
>
> This is the third callback that just iterates over a block range, after
> pg_prewarm_read_stream_next_block() and
> copy_storage_using_buffer_read_stream_next_block(). While not a big problem,
> I think it's time to have a general-use callback for block range scans. The
> quantity of duplicate code is low, but the existing function names are long
> and less informative than a behavior-based name.

I agree. Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

> > +static BlockNumber
> > +collect_corrupt_items_read_stream_next_block(ReadStream *stream,
> > + void *callback_private_data,
> > + void *per_buffer_data)
> > +{
> > + struct collect_corrupt_items_read_stream_private *p = callback_private_data;
> > +
> > + for (; p->blocknum < p->nblocks; p->blocknum++)
> > + {
> > + bool check_frozen = false;
> > + bool check_visible = false;
> > +
> > + if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
> > + check_frozen = true;
> > + if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
> > + check_visible = true;
> > + if (!check_visible && !check_frozen)
> > + continue;
>
> If a vm has zero bits set, this loop will scan the entire vm before returning.
> Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
> VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

Done. I added CHECK_FOR_INTERRUPTS() to the loop in callback.

> > @@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
> > items->count = 64;
> > items->tids = palloc(items->count * sizeof(ItemPointerData));
> >
> > + p.blocknum = 0;
> > + p.nblocks = nblocks;
> > + p.rel = rel;
> > + p.vmbuffer = &vmbuffer;
> > + p.all_frozen = all_frozen;
> > + p.all_visible = all_visible;
> > + stream = read_stream_begin_relation(READ_STREAM_FULL,
> > + bstrategy,
> > + rel,
> > + MAIN_FORKNUM,
> > + collect_corrupt_items_read_stream_next_block,
> > + &p,
> > + 0);
> > +
> > /* Loop over every block in the relation. */
> > for (blkno = 0; blkno < nblocks; ++blkno)
>
> The callback doesn't return blocks having zero vm bits, so the blkno variable
> is not accurate. I didn't test, but I think the loop's "Recheck to avoid
> returning spurious results." looks at the bit for the wrong block. If that's
> what v1 does, could you expand the test file to catch that? For example, make
> a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

> Since the callback skips blocks, this function should use the
> read_stream_reset() style of looping:

Done.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

v2 of the patches are attached, only 0002 is updated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment Content-Type Size
v2-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patch text/x-patch 3.4 KB
v2-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch text/x-patch 5.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jingtang Zhang 2024-08-23 11:50:22 Redundant assignment of table OID for a HeapTuple?
Previous Message Andreas Karlsson 2024-08-23 11:11:38 Re: pgstattuple: fix free space calculation