From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | KNNGiST for knn-search |
Date: | 2009-11-23 17:44:01 |
Message-ID: | 4B0AC9E1.5050509@sigaev.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi there,
http://www.sigaev.ru/misc/knngist-0.11.tar.gz
we'd like to present contrib module for CVS HEAD, which contains implementation
of knn (k nearest neighbourhood) search in PostgreSQL, see README.knngist for
details. KNNGiST is an extension of existing GiST, which inherits most of
their code, so it's has recovery (WAL-logged) and concurrency support.
Basically, it introduces a new distance-based priority queue tree traversal
algorithm (instead of depth-first in plain GiST), so index scan returns rows
sorted by closiness to a query. Notice, index returns all rows, so one should
use LIMIT clause to specify k (which is usually small) to get real benefits.
We get 300x times better performance on real-life data (about 1 mln points):
Module requires rbtree and point_ops patches applied.
(http://archives.postgresql.org/message-id/4B0A8DFA.7050009@sigaev.ru and
http://archives.postgresql.org/message-id/4B0A8F0F.3020308@sigaev.ru)
Old way:
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
order by dist asc LIMIT 10;
Time: 1024.242 ms
knn-search:
SELECT coordinates, (coordinates <-> '5.0,5.0'::point) AS dist FROM spots
WHERE coordinates >< '5.0,5.0'::point LIMIT 10;
Time: 3.158 ms
We didn't patch existing implementation of GiST for several reasons:
1. KNNGiST is about 5% slower than GiST on non-knn search queries, like
contains or contained by, because of some overhead of new algorithm of
tree traversal
2. KNNGiST can't be used in bitmap index scan, which destroys order of results,
We don't know the way to forbid bitmap index scan only for knn queries.
Current version of KNNGiST doesn't distinguish knn-search and usual search
and postgres doesn't know about ordered output from KNNGiST.
We see several ways to add KNNGiST to PostgreSQL:
1. Patch existing GiST.
Con - see problems above.
Pro - any existing GiST opclasses will work with KNNGiST.
2. add KNNGIST as a contrib module like now.
Con - there is no way in PostgreSQL to test other modules, which depends
on KNNGiST. For example, it's easy to add support for trigrams, but
then we add dependence on contrib/pg_trgm module.
3. add KNNGiST as separate access method into core of PostgreSQL.
We think it the best way, since we don't touch existing GiST and opclasses,
and could use KNNGiST in other contrib modules
Separate problem is query planning.
1. To use KNNGiST we need to use index scan ! To prevent seqscan on table,
operator >< has extremely high cost.
2. To prevent bitmap index scan, KNNGiST doesn't have bitmap index scan
interface at all (no amgetbitmap method).
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
Attachment | Content-Type | Size |
---|---|---|
README.knngist | text/plain | 11.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2009-11-23 17:44:02 | Re: Re: [COMMITTERS] pgsql: Remove -w (--ignore-all-space) option from pg_regress's diff |
Previous Message | Simon Riggs | 2009-11-23 17:39:16 | Re: Partitioning option for COPY |