From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Trying out read streams in pgvector (an extension) |
Date: | 2024-06-11 07:47:31 |
Message-ID: | e6c01b35-fb3f-4a36-a765-d731752f1b6c@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11/06/2024 07:53, Thomas Munro wrote:
> Someone involved in that project mentioned that it's probably not a
> great topic to research in practice, because real world users of HNSW
> use fully cached ie prewarmed indexes, because the performance is so
> bad otherwise. (Though maybe that argument is a little circular...).
I think that's true in practice for *building* an HNSW index, but faster
*searching* when the index is not in memory seems quite useful. And of
course, faster is always better, even if it's only in a non-optimal
scenario.
> So although this patch clearly speeds up cold HSNW searches to a
> degree controlled by effective_io_concurrency, I'll probably look for
> something else. Suggestions for interesting index types to look at
> streamifying are very welcome!
GiST and GIN?
> Hmm. If that's really true about HNSW though, then there may still be
> an opportunity to do automatic memory prefetching[1]. But then in the
> case of index building, "stream" is NULL in this patch anyway. It
> surely must also be possible to find some good places to put
> profitable explicit pg_mem_prefetch() calls given the predictability
> and the need to get only ~60ns ahead for that usage. I didn't look
> into that because I was trying to prove things about read_stream.c,
> not get involved in another project :-D
>
> Here ends my science experiment report, which I'm dropping here just
> in case others see useful ideas here. The main thing I learned about
> the read stream API is that it'd be nice to be able to reset the
> stream but preserve the distance (something that came up on the
> streaming sequential scan thread for a different reason), to deal with
> cases where look-ahead opportunities come in bursts but you want a
> longer lived stream than I used here. That is the reason the patch
> creates and destroys temporary streams in a loop; doh. It also
> provides an interesting case study for what speculative random
> look-ahead support might need to look like.
This reminds me of a prototype I wrote earlier, see
https://github.com/pgvector/pgvector/pull/386, 1st commit. It
reorganizes HnswSearchLayer() so that it in iteration, it first collects
all the neighbors to visit, and then visits them, somewhat similar to
your patch.
--
Heikki Linnakangas
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Tender Wang | 2024-06-11 08:11:26 | Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop() |
Previous Message | Bertrand Drouvot | 2024-06-11 07:25:11 | Re: Track the amount of time waiting due to cost_delay |