Quick Links

Re: [HACKERS] Index greater than 8k

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Index greater than 8k
Date:	2006-11-01 04:44:01
Message-ID:	20061101044401.GI12008@alvh.no-ip.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> > Darcy Buskermolen wrote:
> >> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> >>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> >>>> a self contained test case directly to Teodor which shows the error.
> >>>>
> >>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
> >>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> >>> pg_trgm is designed to find similar words and use technique known as
> >>> trigrams. This will work good on small pieces of text such as words or set
> >>> expression. But all big texts (on the same language) will be similar :(.
> >>> So, I didn't take care about guarantee that index tuple's size limitation.
> >>> In principle, it's possible to modify pg_trgm to have such guarantee, but
> >>> index becomes lossy - all tuples gotten from index should be checked by
> >>> table's tuple evaluation.
> >> The problem is some of the data we are working with is not strictly "text" but
> >> bytea that we've run through encode(bytea, 'escape'),
> >
> > I think one good question is why are you storing bytea and then
> > searching like it were text.
>
> We are not storing bytea, a customer is. We are trying to work around
> customer requirements. The data that is being stored is not always text,
> sometimes it is binary (a flash file or jpeg). We are using escaped text
> to be able to search the string contents of that file .

Hmm, have you tried to create a functional trigram index on the
equivalent of "strings(bytea_column)" or something like that?

I imagine strings(bytea) would be a function that returns the
concatenation of all pure (7 bit) ASCII strings in the byte sequence.

On the other hand, based on Teodor's comment on pg_trgm, maybe this
won't be possible at all.

> > Why not store the text as text, and put
> > the extraneous bytes somewhere else? Certainly you wouldn't expect to
> > be able to find text among the bytes, would you?
>
> Yes we do (and can) expect to find text among the bytes. We have
> searches running, we are just running into the maximum size issues for
> certain rows.

Do you mean you actually find stuff based on text attributes in JPEG
images and the like? I thought those were compressed ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

In response to

Re: [HACKERS] Index greater than 8k at 2006-11-01 04:24:15 from Joshua D. Drake

Responses

Re: [HACKERS] Index greater than 8k at 2006-11-01 04:55:04 from Joshua D. Drake
Re: [HACKERS] Index greater than 8k at 2006-11-01 05:01:23 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2006-11-01 04:47:56	Re: Encoding, Unicode, locales, etc.
Previous Message	Gregory S. Williamson	2006-11-01 04:36:55	Re: [HACKERS] Index greater than 8k

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2006-11-01 04:51:40	Re: Extended protocol logging
Previous Message	Gregory S. Williamson	2006-11-01 04:36:55	Re: [HACKERS] Index greater than 8k