Re: Use streaming read API in ANALYZE

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Banck <mbanck(at)gmx(dot)net>
Cc: Mats Kindahl <mats(at)timescale(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, jakub(dot)wartak(at)enterprisedb(dot)com
Subject: Re: Use streaming read API in ANALYZE
Date: 2024-09-09 22:27:43
Message-ID: CA+hUKGLVZi4TEwYUqc24pN-f9LrOTrfTX4SAbTdyZnwTUqBX4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 10, 2024 at 3:36 AM Michael Banck <mbanck(at)gmx(dot)net> wrote:
> I am a bit confused about the status of this thread. Robert mentioned
> RC1, so I guess it pertains to v17 but I don't see it on the open item
> wiki list?

Yes, v17. Alight, I'll add an item.

> Does the above mean you are going to revert it for v17, Thomas? And if
> so, what exactly? The ANALYZE changes on top of the streaming read API
> or something else about that API that is being discussed on this thread?

I might have been a little pessimistic in that assessment. Another
workaround that seems an awful lot cleaner and less invasive would be
to offer a new ReadStream API function that provides access to block
numbers and the strategy, ie the arguments of v16's
scan_analyze_next_block() function. Mats, what do you think about
this? (I haven't tried to preserve the prefetching behaviour, which
probably didn't actually too work for you in v16 anyway at a guess,
I'm just looking for the absolute simplest thing we can do to resolve
this API mismatch.) TimeScale could then continue to use its v16
coding to handle the two-relations-in-a-trenchcoat problem, and we
could continue discussing how to make v18 better.

I looked briefly at another non-heap-like table AM, the Citus Columnar
TAM. I am not familiar with that code and haven't studied it deeply
this morning, but its _next_block() currently just returns true, so I
think it will somehow need to change to counting calls and returning
false when it thinks its been called enough times (otherwise the loop
in acquire_sample_rows() won't terminate, I think?). I suppose an
easy way to do that without generating extra I/O or having to think
hard about how to preserve the loop cound from v16 would be to use
this function.

I think there are broadly three categories of TAMs with respect to
ANALYZE block sampling: those that are very heap-like (blocks of one
SMgrRelation) and can just use the stream directly, those that are not
at all heap-like (doing something completely different to sample
tuples and ignoring the block aspect but using _next_block() to
control the loop), and then Timescale's case which is sort of
somewhere in between: almost heap-like from the point of view of this
sampling code, ie working with blocks, but fudging the meaning of
block numbers, which we didn't anticipate. (I wonder if it fails to
sample fairly across the underlying relation boundary anyway because
their data densities must surely be quite different, but that's not
what we're here to talk about.)

. o O { We need that wiki page listing TAMs with links to the open
source ones... }

Attachment Content-Type Size
0001-Allow-raw-block-numbers-to-be-read-from-ReadStream.patch text/x-patch 1.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-09-09 22:44:36 Re: query ID goes missing with extended query protocol
Previous Message David Rowley 2024-09-09 22:16:34 Re: Proposal to Enable/Disable Index using ALTER INDEX