Re: Trying out read streams in pgvector (an extension)

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Trying out read streams in pgvector (an extension)
Date: 2024-06-11 15:37:09
Message-ID: d8c3bbe9-690d-4d51-9d80-51f9856ee086@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/11/24 12:53 AM, Thomas Munro wrote:
> Hi,
>
> I was looking around for an exotic index type to try the experience of
> streamifying an extension, ie out-of-core code. I am totally new to
> pgvector, but since everyone keeps talking about it, I could not avoid
> picking up some basic facts in the pgconf.dev hallway track, and
> understood that its scans have some degree of known-order access
> predictability, and then also some degree of fuzzy-predictable
> order-not-yet-determined access too. It's also quite random in the
> I/O sense.

Cool! I happened to be chatting w/Andrew about this yesterday to see if
there could be some benefits for folks who are running pgvector on PG17.

> Here's a toy to streamify the known-order part. I think for the fuzzy
> part that links those parts together, maybe there is some way to guess
> when it's a reasonable time to speculatively prefetch the lowest order
> stuff in the pairing heap, and then deal with it if you're wrong, but
> I didn't try that...

I would suggest submitting this at least as a draft PR to the pgvector
project[1]:

https://github.com/pgvector/pgvector

> Someone involved in that project mentioned that it's probably not a
> great topic to research in practice, because real world users of HNSW
> use fully cached ie prewarmed indexes, because the performance is so
> bad otherwise.

I don't think that was me, at least in those words (and I had noted I'd
love to chat w/you about this, but we didn't find time). Stating it
differently, the "ideal" is to keep the indexes in memory, as that leads
to the best performance, but reality is more complicated. These datasets
are quite large (e.g. the 1536-dim vector is a 6KB payload, excluding
what's in the index) and if you're storing the full vector in the index
(there are now some quantization methods available[4]), you can easily
double your dataset size, and quickly exceed available memory. So I
think in the real world, you're more likely to see swapping pages
between disk and memory. Some of this was addressed in the talk @
PGConf.dev[3] (slides here[2]).

> (Though maybe that argument is a little circular...).
> So although this patch clearly speeds up cold HSNW searches to a
> degree controlled by effective_io_concurrency, I'll probably look for
> something else. Suggestions for interesting index types to look at
> streamifying are very welcome!

Yup, so this makes sense for HNSW particularly at the higher-level
pages. But it may make more sense for IVFFlat, given how it clusters
data. With IVFFlat, you first find your lists/centers, and then you
determine how you index each vector around the lists. When those lists
are stored to disk, they're basically sequential. A lot of the struggles
with IVFFlat is both the long load from disk and ultimately some
comptuational issues for a larger set of vector comparisons (though if
you're able to build small, efficient clusters, it can be much faster
than HNSW!). HNSW behaves more like a (bear with me) typically
"tree-based" index, where you'll have hot spots at the top, but because
of the nature of vector search, the lower levels tend to be more random
in access.

Regardless, the part where this is interesting (at least to me) is that
a lot of these vectors tend to take up a full page anyway, so anything
we can do to read them faster from disk will generally get a thumbs up
from me.

> Hmm. If that's really true about HNSW though, then there may still be
> an opportunity to do automatic memory prefetching[1]. But then in the
> case of index building, "stream" is NULL in this patch anyway. It
> surely must also be possible to find some good places to put
> profitable explicit pg_mem_prefetch() calls given the predictability
> and the need to get only ~60ns ahead for that usage. I didn't look
> into that because I was trying to prove things about read_stream.c,
> not get involved in another project :-D

Well, as alluded to in[2], thinking about how another project uses this
will certainly help, and anything we can do to continue to speed up
vector queries helps PostgreSQL ;) Some of the contributions from folks
who have focused on core have significantly helped pgvector.

> Here ends my science experiment report, which I'm dropping here just
> in case others see useful ideas here. The main thing I learned about
> the read stream API is that it'd be nice to be able to reset the
> stream but preserve the distance (something that came up on the
> streaming sequential scan thread for a different reason), to deal with
> cases where look-ahead opportunities come in bursts but you want a
> longer lived stream than I used here. That is the reason the patch
> creates and destroys temporary streams in a loop; doh. It also
> provides an interesting case study for what speculative random
> look-ahead support might need to look like.

If you're curious, I can fire up some of my more serious benchmarks on
this to do a before/after to see if there's anything interesting. I have
a few large datasets (10s of millions) of larger vectors (1536dim => 6KB
payloads) that could see the net effect here.

> (Make sure you remember to set effective_io_concurrency to an
> interesting number if you want to generate a lot of overlapping
> fadvise calls.)

What would you recommend as an "interesting number?" - particularly
using the data parameters above.

Thanks,

Jonathan

[1] https://github.com/pgvector/pgvector
[2]
https://www.pgevents.ca/events/pgconfdev2024/sessions/session/1/slides/42/pgconfdev-2024-vectors.pdf
[3]
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/1-vectors-how-to-better-support-a-nasty-data-type-in-postgresql/
[4] https://jkatz05.com/post/postgres/pgvector-scalar-binary-quantization/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-06-11 16:03:41 Re: use CREATE DATABASE STRATEGY = FILE_COPY in pg_upgrade
Previous Message Dave Page 2024-06-11 15:33:55 Re: Windows: openssl & gssapi dislike each other