Quick Links

Re: scoring differences between bitmasks

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Ben <bench(at)silentmedia(dot)com>
Cc:	Postgresql-General mailing list <pgsql-general(at)postgresql(dot)org>
Subject:	Re: scoring differences between bitmasks
Date:	2005-10-02 16:47:06
Message-ID:	20051002164700.GI30492@svana.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Sun, Oct 02, 2005 at 08:28:53AM -0700, Ben wrote:
> Yes, that's the straightforward way to do it. But given that my
> vectors are 256 bits in length, and that I'm going to eventually have
> about 4 million of them to search through, I was hoping greater minds
> than mine had figured out how to do it faster, or how compute some
> kind of indexing....... somehow.

Ah right, that's another kettle of fish. Well, your problem space is a
form of eulidean space in the sense that the triangle inequality holds,
eg dist(A,C) <= dist(A,B)+dist(B,C). Maybe one of the geometric index
classes could help.

Here comes something I thought up myself, but you may be able to come
up with something better.

1. For each string, store its distance from the strings:
0000000000 etc
0101010101 etc
0011001100 etc
maybe more or less, you'll need to investigate

2. Index each of these columns.
3. For your target string calculate the distances from each of these
4. Do a query like:

SELECT key, * from table
where dist_zero_string > $dist_zero_search-2 and dist_zero_string < $dist_zero_search+2
where dist_0101_string > $dist_0101_search-2 and dist_0101_string < $dist_0101_search+2
where dist_0011_string > $dist_0101_search-2 and dist_0011_string < $dist_0011_search+2
order by dist(key,searchvalue);

PostgreSQL 8.1 has bitmap index scans which could make the above quite
fast. Some theory may help determine the best key strings. For example
all-zeros and all-ones are redundant since you can calculate the
distance for one from the other. You also might need to vary your
distances to match with (the 2 above) depending on the average distance
to the strings and where you search.

You could use functional indexes to avoid storing the actual values in
the tables. Or you could take two or three of these values and make
points and use an rtree index on those to find nearby points (which
will represent nearby strings).

Anyway, this is just some random thoughts, hope it helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

In response to

Re: scoring differences between bitmasks at 2005-10-02 15:28:53 from Ben

Responses

Re: scoring differences between bitmasks at 2005-10-02 17:17:10 from Ben

Browse pgsql-general by date

	From	Date	Subject
Next Message	Wes	2005-10-02 17:15:05	Re: 8.1 'make check' fails
Previous Message	Samik Raychaudhuri	2005-10-02 16:35:01	Portable PostgreSQL