Re: Use streaming read API in ANALYZE

From: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, jakub(dot)wartak(at)enterprisedb(dot)com
Subject: Re: Use streaming read API in ANALYZE
Date: 2024-04-29 15:41:09
Message-ID: CAN55FZ3Q9FbGtszRXPf=0wH3-A0XWkjscEzEf=V7-rZku6DGMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, 8 Apr 2024 at 04:21, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> Pushed. Thanks Bilal and reviewers!

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

After the revert of 27bc1772fc8, acquire_sample_rows() became table-AM
agnostic again. So, read stream changes might have to be pushed down
now but there are a couple of roadblocks like Melanie mentioned [1]
before.

Quote from Melanie [1]:

On Thu, 11 Apr 2024 at 19:19, Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).
>
> read_stream_begin_relation() doesn't just save the
> BufferAccessStrategy in the ReadStream, it uses it to set various
> other things in the ReadStream object. callback_private_data (which in
> ANALYZE's case is the BlockSamplerData) is simply saved in the
> ReadStream, so it could be set later, but that doesn't sound very
> clean to me.
>
> As such, it seems like a cleaner alternative would be to add a table
> AM callback for creating a read stream object that takes the
> parameters of read_stream_begin_relation(). But, perhaps it is a bit
> late for such additions.

If we do not want to add a new table AM callback like Melanie
mentioned, it is pretty much required to pass BufferAccessStrategy and
BlockSamplerData to the initscan().

> It also opens us up to the question of whether or not sequential scan
> should use such a callback instead of making the read stream object in
> heap_beginscan().
>
> I am happy to write a patch that does any of the above. But, I want to
> raise these questions, because perhaps I am simply missing an obvious
> alternative solution.

I wonder the same, I could not think of any alternative solution to
this problem.

Another quote from Melanie [2] in the same thread:

On Thu, 11 Apr 2024 at 20:48, Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> I will also say that, had this been 6 months ago, I would probably
> suggest we restructure ANALYZE's table AM interface to accommodate
> read stream setup and to address a few other things I find odd about
> the current code. For example, I think creating a scan descriptor for
> the analyze scan in acquire_sample_rows() is quite odd. It seems like
> it would be better done in the relation_analyze callback. The
> relation_analyze callback saves some state like the callbacks for
> acquire_sample_rows() and the Buffer Access Strategy. But at least in
> the heap implementation, it just saves them in static variables in
> analyze.c. It seems like it would be better to save them in a useful
> data structure that could be accessed later. We have access to pretty
> much everything we need at that point (in the relation_analyze
> callback). I also think heap's implementation of
> table_beginscan_analyze() doesn't need most of
> heap_beginscan()/initscan(), so doing this instead of something
> ANALYZE specific seems more confusing than helpful.

If we want to implement ANALYZE specific counterparts of
heap_beginscan()/initscan(); we may think of passing
BufferAccessStrategy and BlockSamplerData to them.

Also, there is an ongoing(?) discussion about a few problems /
improvements about the acquire_sample_rows() mentioned at the end of
the 'Table AM Interface Enhancements' thread [3]. Should we wait for
these discussions to be resolved or can we resume working on this
patch?

Any kind of feedback would be appreciated.

[1] https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-04-29 16:14:10 Re: Tarball builds in the new world order
Previous Message Heikki Linnakangas 2024-04-29 15:24:04 Direct SSL connection and ALPN loose ends