Re: Index on TEXT versus CHAR(32)... fast exact TEXT matching

From: "NTPT" <ntpt(at)centrum(dot)cz>
To: <lapham(at)jandr(dot)org>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Index on TEXT versus CHAR(32)... fast exact TEXT matching
Date: 2004-09-06 22:23:56
Message-ID: 006301c49460$3399df70$74beebd5@wbp1
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

what about to use a CRC32 checksum of the text, computed by client
application an then make index of crc32 data ?

ie

add column crc int4,
add column md5 varchar(255 )
create index blabla on mytable (crc)
or even create index blabla2 on mytable (crc,md5)

and query like ......... where crc='crc32 of your searched text' AND
md5='md5 hash'

i am not shure, but this should work pretty fast.

----- Original Message -----
From: "Jon Lapham" <lapham(at)jandr(dot)org>
To: <pgsql-general(at)postgresql(dot)org>
Sent: Saturday, September 04, 2004 3:04 PM
Subject: [GENERAL] Index on TEXT versus CHAR(32)... fast exact TEXT matching

> I have a table that stores TEXT information. I need query this table to
> find *exact* matches to the TEXT... no regular expressions, no LIKE
> queries, etc. The TEXT could be from 1 to 10000+ characters in length,
> quite variable. If it matters, the TEXT may contain UNICODE characters...
>
> Example:
> CREATE TABLE a (id SERIAL, thetext TEXT);
> SELECT id FROM a WHERE thetext='Some other text';
>
> One way I thought to optimize this process would be to store an MD5 hash
> of the "thetext" column and query on that:
>
> CREATE TABLE a (id SERIAL, thetext TEXT, thetext_md5 CHAR(32));
> SELECT id FROM a WHERE thetext_m5d=md5('Some other text');
>
> Now, obviously I would want to build an INDEX on either "thetext" or
> "thetext_md5", depending on which way I decide to make the table.
>
> My question is, what is the absolute fastest way to find the exact match
> to a TEXT column? Any amount of pre-processing is fine (such as
> calculating the MD5's of all the TEXT tuples), but the query must be
> extremely fast.
>
> Has anyone compared (theoretical or practical) the performance of
> querying a TEXT-based INDEX versus a CHAR(32)-based INDEX? Is my MD5
> idea a waste of time? Is there something better than MD5? Would it be
> better to store the "fingerprint" of the TEXT as an integer somehow, so
> that the INDEX could be based on a INT* column?
>
> Thanks for any help!
>
> -Jon
>
> --
> -**-*-*---*-*---*-*---*-----*-*-----*---*-*---*-----*-----*-*-----*---
> Jon Lapham <lapham(at)jandr(dot)org> Rio de Janeiro, Brasil
> Personal: http://www.jandr.org/
> ***-*--*----*-------*------------*--------------------*---------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faqs/FAQ.html
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Google Mike 2004-09-07 00:11:40 FAQ: Having Trouble Getting PL/pgSQL Going?
Previous Message Alvaro Herrera 2004-09-06 21:23:31 Re: Passing RECORD variable from func1() to func2()