Re: Use streaming read API in ANALYZE

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, jakub(dot)wartak(at)enterprisedb(dot)com
Subject: Re: Use streaming read API in ANALYZE
Date: 2024-08-22 07:31:06
Message-ID: CA+14425+Ccm07ocG97Fp+FrD9xUXqmBKFvecp0p+gV2YYR258Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 20, 2024 at 10:46 PM Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:

> On Wed, May 15, 2024 at 2:18 PM Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
> wrote:
> >
> > On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
> wrote:
> > >
> > > On Mon, 8 Apr 2024 at 04:21, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
> wrote:
> > > I wanted to discuss what will happen to this patch now that
> > > 27bc1772fc8 is reverted. I am continuing this thread but I can create
> > > another thread if you prefer so.
> >
> > 041b96802ef is discussed in the 'Table AM Interface Enhancements'
> > thread [1]. The main problems discussed about this commit is that the
> > read stream API is not pushed to the heap-specific code and, because
> > of that, the other AM implementations need to use read streams. To
> > push read stream API to the heap-specific code, it is pretty much
> > required to pass BufferAccessStrategy and BlockSamplerData to the
> > initscan().
> >
> > I am sharing the alternative version of this patch. The first patch
> > just reverts 041b96802ef and the second patch is the alternative
> > version.
> >
> > In this alternative version, the read stream API is not pushed to the
> > heap-specific code, but it is controlled by the heap-specific code.
> > The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
> > heap-specific code if the scan type is 'ANALYZE'. This flag is used to
> > decide whether streaming API in ANALYZE will be used or not. If this
> > flag is set, this means heap AMs and read stream API will be used. If
> > it is not set, this means heap AMs will not be used and code falls
> > back to the version before read streams.
>
> Personally, I think the alternative version here is the best option
> other than leaving what is in master. However, I would vote for
> keeping what is in master because 1) where we are in the release
> timeline and 2) the acquire_sample_rows() code, before streaming read,
> was totally block-based anyway.
>
> If we kept what was in master, do we need to document for table AMs
> how to use read_stream_next_buffer() or can we assume they will look
> at the heap AM implementation?
>

Hi all,

I ran into this with the PG17 beta3 and for our use-case we need to set up
another stream (using a different relation and/or fork, but using the same
strategy) in addition to the one that is passed in to the
scan_analyze_next_block(), so to be able to do that it is necessary to have
the block sampler and the strategy from the original stream. Given that
this makes it very difficult (see below) to set up a different ReadStream
inside the TAM unless you have the BlockSampler and the BufferReadStrategy,
and the old interface did not have this problem, I would consider this a
regression.

This would be possible to solve in a few different ways:

1. The alternate version proposed by Nazir allows you to decide which
interface to use.
2. Reverting the patch entirely would also solve the problem.
3. Passing down the block sampler and the strategy to scan_begin() and
move the ReadStream setup in analyze.c into initscan() in heapam.c, but
this requires adding new parameters to this function.
4. Having accessors that allow you to get the block sampler and strategy
from the ReadStream object.

The proposed solution 1 above would still not solve the problem of allowing
a user to set up a different or extra ReadStream if they want to use the
new ReadStream interface. Reverting the ReadStream patch entirely would
also deal with the regression, but I find the ReadStream interface very
elegant since it moves the block sampling into a separate abstraction and
would like to use it, but right now there are some limitations if you want
to use it fully. The third solution above would allow that, but it requires
a change in the signature of scan_begin(), which might not be the best at
this stage of development. Proposal 4 would allow you to construct a new
stream based on the old one and might be a simple alternative solution as
well with less changes to the current code.

It is possible to capture the information in ProcessUtility() and
re-compute all the parameters, but that is quite a lot of work to get
right, especially considering that these computations are all over the
place and part of different functions at different stages (For example,
variable ring_size, needed to set up the buffer access strategy is computed
in ExecVacuum(); variable targrows, used to set up the buffer sampler, is
computed inside acquire_sample_rows(), which in turn requires to decide
what attributes to analyze, which is computed in do_analyze_rel().)

It would be great if this could be fixed before the PG17 release now that
27bc1772fc8 was reverted.
--
Best wishes,
Mats Kindahl, Timescale

Attachment Content-Type Size
0001-Add-accessors-for-ReadStream.v1.patch text/x-patch 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-08-22 07:31:54 Re: RFC: Additional Directory for Extensions
Previous Message Anthonin Bonnefoy 2024-08-22 07:21:39 Segfault in jit tuple deforming on arm64 due to LLVM issue