From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Index on regexes |
Date: | 2013-06-13 20:19:35 |
Message-ID: | CAPpHfduMvS=sGPNh_88bUo3TEg_KhxV7s81Fa+ZV96yadgff2g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hackers,
Attached patch contains opclass which demonstrates advantages of GIN
additional information storing itself without other GIN improvements. It
implements inversed task of regex indexing. It works so: you create index
on regexes and search for regexes matched query string. It introduce two
additional operators |~ and *~ for case-sensetive and case-insensetive
regex to string matching, and gin_regexp_trgm_ops opclass.
Let's consider some example.
At first, generate some regexes.
CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$
SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer)
FROM generate_series(1,($1 + random()*$2)::int)), '');
$$
LANGUAGE sql;
CREATE TABLE test AS select '(' || generate_string(1,4) || '|' ||
generate_string(1,4) || '|' || generate_string(1,4) || ')' ||
generate_string(1,2) || '(' || generate_string(1,4) || '|' ||
generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM
generate_series(1,1000000);
I use only 10 characters on alphabet in order to have better chance of
matching. It generate some regexes like so:
postgres=# SELECT * FROM test LIMIT 10;
s
------------------------------------
(g|cij|ah)jg(iei|hfc|eef)
(gbfdb|ehbg|akf)ge(bc|jgee|jidd)
(jedc|kgc|c)bc(ii|bji|iebc)
(aa|eie|bgdb)f(fc|he|f)
(b|ijc|ae)ae(eccb|ie|kjf)
(bib|igf|kdibf)fij(gcbh|efi|fidj)
(bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg)
(kfb|egccd|iefce)jf(kj|jbef|kbc)
(bhh|c|cd)cb(h|ed|jg)
(id|j|geg)gc(djif|ai|cjjjc)
(10 rows)
Without index search takes about 10 seconds.
postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..19929.00 rows=5000 width=28) (actual
time=172.990..97357.312 rows=438 loops=1)
Filter: (s |~ 'abcdefghijkl'::text)
Rows Removed by Filter: 999562
Total runtime: 97357.490 ms
(4 rows)
And with index it takes only 110 milliseconds.
postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=182.75..7245.94 rows=5000 width=28)
(actual time=68.143..110.663 rows=438 loops=1)
Recheck Cond: (s |~ 'abcdefghijkl'::text)
-> Bitmap Index Scan on test_idx (cost=0.00..181.50 rows=5000 width=0)
(actual time=67.929..67.929 rows=438 loops=1)
Index Cond: (s |~ 'abcdefghijkl'::text)
Total runtime: 110.870 ms
(5 rows)
------
With best regards,
Alexander Korotkov.
Attachment | Content-Type | Size |
---|---|---|
index_on_regexes.1.patch.gz | application/x-gzip | 7.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2013-06-13 20:56:18 | Re: Adjusting elog behavior in bootstrap/standalone mode |
Previous Message | Alexander Korotkov | 2013-06-13 20:09:52 | GIN improvements part 1: additional information |