From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg_trgm partial-match |
Date: | 2012-11-19 01:56:29 |
Message-ID: | 50A991CD.5090901@fuzzy.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 15.11.2012 20:39, Fujii Masao wrote:
> Hi,
>
> I'd like to propose to extend pg_trgm so that it can compare a partial-match
> query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial'
> GIN method for pg_trgm.
>
> Currently, when the query key is less than three characters, we cannot use
> a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a
> partial-match method. In this case, seq scan or index full scan would be
> executed, and its response time would be very slow. I'd like to alleviate this
> problem.
>
> Note that we cannot do a partial-match if KEEPONLYALNUM is disabled,
> i.e., if query key contains multibyte characters. In this case, byte length of
> the trigram string might be larger than three, and its CRC is used as a
> trigram key instead of the trigram string itself. Because of using CRC, we
> cannot do a partial-match. Attached patch extends pg_trgm so that it
> compares a partial-match query key only when KEEPONLYALNUM is
> enabled.
>
> Attached patch is WIP yet. What I should do next is:
>
> * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc.
> * write the regression test
>
> Comments? Review? Objection?
Hi,
I've done a quick review of the current patch:
(1) It applies cleanly on the current master and builds fine.
(2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented
differently (using tabs instead of spaces).
(3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef
KEEPONLYALNUM, although trgm_pmatch is not used at all.
(4) The patch removes some commented-out variables, but there still
remain various commented-out variables. Will this be cleaned too?
(5) I've done basic functionality of the patch, it really seems to work:
CREATE EXTENSION pg_trgm ;
CREATE TABLE TEST (val TEXT);
INSERT INTO test
SELECT md5(i::text) FROM generate_series(1,1000000) s(i);
CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops);
ANALYZE test;
EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%';
QUERY PLAN
------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=1655.96..11757.63 rows=141414 width=33)
Recheck Cond: (val ~~ '%aa%'::text)
-> Bitmap Index Scan on trgm_idx (cost=0.00..1620.61 rows=141414
width=0)
Index Cond: (val ~~ '%aa%'::text)
(4 rows)
Without the patch, this gives a seq scan (as expected).
Do you expect to update the docs too? IMHO it's worth mentioning that
the pg_trgm can handle even patterns shorter than 2 chars ...
regards
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2012-11-19 05:38:32 | Re: Proposal for Allow postgresql.conf values to be changed via SQL |
Previous Message | Xi Wang | 2012-11-19 01:15:09 | Re: [RFC] Fix div/mul crash and more undefined behavior |