From: | Giuseppe Broccolo <g(dot)broccolo(dot)7(at)gmail(dot)com> |
---|---|
To: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, mail(at)joeconway(dot)com |
Subject: | Re: vector search support |
Date: | 2023-05-29 13:18:03 |
Message-ID: | CAFtuf8AzttX4Vzy5AZebNY_PxKzve_aWYf+0YUFfeKJ0xzYm_A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Jonathan,
On 5/26/23 3:38 PM, Jonathan S. Katz <jkatz(at)postgresql(dot)org> wrote:
> On 4/26/23 9:31 AM, Giuseppe Broccolo wrote:
> > We finally opted for ElasticSearch as search engine, considering that it
> > was providing what we needed:
> >
> > * support to store dense vectors
> > * support for kNN searches (last version of ElasticSearch allows this)
>
> I do want to note that we can implement indexing techniques with GiST
> that perform K-NN searches with the "distance" support function[1], so
> adding the fundamental functions to help with this around known vector
> search techniques could add this functionality. We already have this
> today with "cube", but as Nathan mentioned, it's limited to 100 dims.
>
Yes, I was aware of this. It would be enough to define the required support
functions for GiST
indexing (I was a bit in the loop when it was tried to add PG14 presorting
support to GiST indexing
in PostGIS[1]). That would be really helpful indeed. I was just mentioning
it because I know about
other teams using ElasticSearch as a storage of dense vectors only for this.
> > An internal benchmark showed us that we were able to achieve the
> > expected performance, although we are still lacking some points:
> >
> > * clustering of vectors (this has to be done outside the search engine,
> > using DBScan for our use case)
>
> From your experience, have you found any particular clustering
> algorithms better at driving a good performance/recall tradeoff?
>
Nope, it really depends on the use case: the point of using DBScan above
was mainly because it's a way of clustering without knowing a priori the
number
of clusters the algorithm should be able to retrieve, which is actually a
parameter
needed for Kmeans. Depending on the use case, DBScan might have better
performance in noisy datasets (i.e. entries that really do not belong to a
cluster in
particular). Noise in vectors obtained with embedding models is quite
normal,
especially when the embedding model is not properly tuned/trained.
In our use case, DBScan was more or less the best choice, without biasing
the
expected clusters.
Also PostGIS includes an implementation of DBScan for its geometries[2].
> > * concurrency in updating the ElasticSearch indexes storing the dense
> > vectors
>
> I do think concurrent updates of vector-based indexes is one area
> PostgreSQL can ultimately be pretty good at, whether in core or in an
> extension.
Oh, it would save a lot of overhead in updating indexed vectors! It's
something needed
when embedding models are re-trained, vectors are re-generated and indexes
need to
be updated.
Regards,
Giuseppe.
[1]
https://github.com/postgis/postgis/blob/a4f354398e52ad7ed3564c47773701e4b6b87ae8/doc/release_notes.xml#L284
[2]
https://github.com/postgis/postgis/blob/ce75a0e81aec2e8a9fad2649ff7b230327acb64b/postgis/lwgeom_window.c#L117
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2023-05-29 13:35:25 | Re: make_ctags: use -I option to ignore pg_node_attr macro |
Previous Message | vignesh C | 2023-05-29 12:46:22 | Re: Support logical replication of DDLs |