Re: Use streaming read API in ANALYZE

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Michael Banck <mbanck(at)gmx(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, jakub(dot)wartak(at)enterprisedb(dot)com
Subject: Re: Use streaming read API in ANALYZE
Date: 2024-09-20 06:36:42
Message-ID: CA+14425U9MC9AZEvnNcCoUvTH39v_Y4p4tB3jQheK=_e65RKKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 18, 2024 at 5:13 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl <mats(at)timescale(dot)com> wrote:
> > I used the combination of your patch and making the computation of
> vacattrstats for a relation available through the API and managed to
> implement something that I think does the right thing. (I just sampled a
> few different statistics to check if they seem reasonable, like most common
> vals and most common freqs.) See attached patch.
>
> Cool. I went ahead and committed that small new function and will
> mark the open item closed.
>

Thank you Thomas, this will help a lot.

> > I need the vacattrstats to set up the two streams for the internal
> relations. I can just re-implement them in the same way as is already done,
> but this seems like a small change that avoids unnecessary code duplication.
>
> Unfortunately we're not in a phase where we can make non-essential
> changes, we're right about to release and we're only committing fixes,
> and it seems like you have a way forward (albeit with some
> duplication). We can keep talking about that for v18.
>

Yes, I can work around this by re-implementing the same code that is
present in PostgreSQL.

>
> From your earlier email:
> > I'll take a look at the thread. I really think the ReadStream
> abstraction is a good step in the right direction.
>
> Here's something you or your colleagues might be interested in: I was
> looking around for a fun extension to streamify as a demo of the
> technology, and I finished up writing a quick patch to streamify
> pgvector's HNSW index scan, which worked well enough to share[1] (I
> think it should in principle be able to scale with the number of graph
> connections, at least 16x), but then people told me that it's of
> limited interest because everybody knows that HNSW indexes have to fit
> in memory (I think there may also be memory prefetch streaming
> opportunities, unexamined for now). But that made me wonder what the
> people with the REALLY big indexes do for hyperdimensional graph
> search on a scale required to build Skynet, and that led me back to
> Timescale pgvectorscale[2]. I see two obvious signs that this thing
> is eminently and profitably streamifiable: (1) The stated aim is
> optimising for indexes that don't fit in memory, hence "Disk" in the
> name of the research project it is inspired by, (2) I see that
> DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP
> (Windows). So now I am waiting patiently for a Rustacean to show up
> with patches for pgvectorscale to use ReadStream, which would already
> get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon
> hopefully), and hopefully also provide a nice test case for the AIO
> patch set which redirects buffer reads through io_uring (Linux,
> basically the newer better libaio) or background I/O workers (other
> OSes, which works surprisingly competitively). Just BTW for
> comparison with DiskANN we have also had early POC-quality patches
> that drive AIO with overlapped/IOCP (Windows) which will eventually be
> rebased and proposed (Windows isn't really a primary target but we
> wanted to validate that the stuff we're working on has abstractions
> that will map to the obvious system APIs found in the systems
> PostgreSQL targets). For completeness, I've also had it mostly
> working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped
> support for those last two so that was a bit of a dead end).

> [1]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com
> [2] https://github.com/timescale/pgvectorscale
> [3] https://github.com/microsoft/DiskANN
>

Thanks Thomas, this looks really interesting. I've forwarded it to the
pgvectorscale team.
--
Best wishes,
Mats Kindahl, Timescale

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-09-20 06:52:17 Re: Add contrib/pg_logicalsnapinspect
Previous Message Tender Wang 2024-09-20 06:31:09 Re: not null constraints, again