From: | Martin Gregorie <martin(at)gregorie(dot)org> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Matching on keyword or phrases within a field that is delimited with an "or" operator "|" |
Date: | 2012-03-13 00:38:04 |
Message-ID: | 1331599085.15175.73.camel@zappa.gregorie.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 2012-03-12 at 15:57 -0700, Jim Ostler wrote:
> I have a table that is around 20 GB, so I need to optimize as best as
> possible the matching with another table on keywords across multiple
> fields. I have around 10 fields that have keywords or phrases
> delimited with the "or" operator "|". So it would be in the form of
> "a | b | and jack | cd" . There are around 20 keywords or phrases
> per field, and these keywords could be any word.
>
How static is the list of keywords?
Is there any significance in their order? (I'm guessing there isn't).
How many rows are there in the table, IOW how big is each row?
I wonder if putting the keywords in a separate table with a many:many
relationship with your big table would help. This would retain your
ability to add or remove keywords without affecting the schema. The
selection rules used to combine keywords would also run fast.
I'm getting good performance from a mail archive where messages are
selected using:
- a 1:M relationship with a subject table
- a M:M relationship with an address table
- range comparison with the date sent (non-indexed field in the message)
- and ILIKE(%phrase%) comparison with the message text
The query is assembled using WHERE clauses for each of the four
selection possibilities listed above before being prepared and issued. A
where clause is only included if the user has specified a value for it.
There is an AND relationship between the clauses. Currently there are
around 130,000 messages in the database. Speed: it took 0.6 seconds to
find 209 messages to or from an address in the last 5 years and
containing the keyword 'f1a'. The same search without the date range
specified ran in 0.86 seconds. It took 16.9 seconds to find the 12331
messages containing 'f1a' when no address or data range were specified.
This is on a dual core, 3.2 GHz Athlon with 4GB of RAM using PG 9.1.
The other possibility is to use a star schema with each keyword being a
dimension - IOW a traditional data warehouse set-up, but of course this
does require a static list of keywords to be defined.
Martin
From | Date | Subject | |
---|---|---|---|
Next Message | Bosco Rama | 2012-03-13 00:39:27 | Re: Calculated update |
Previous Message | Rob Sargent | 2012-03-13 00:32:21 | Re: Calculated update |