Quick Links

Re: indexing elements of a csv ?

From:	Ian Lawrence Barwick <barwick(at)gmail(dot)com>
To:	"Gauthier, Dave" <dave(dot)gauthier(at)intel(dot)com>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: indexing elements of a csv ?
Date:	2013-03-12 15:20:16
Message-ID:	CAB8KJ=iJtdtRdw69fcFQLnWBd8OW6CR9PHUndU=C9mAbhCEvfg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

2013/3/13 Ian Lawrence Barwick <barwick(at)gmail(dot)com>:
> 2013/3/12 Gauthier, Dave <dave(dot)gauthier(at)intel(dot)com>:
>> Hi:
>>
>> v9.0.1 on linux.
>>
>> I have a table with a column that is a csv. Users will select records based
>> upon the existence of an element of the csv. There is an index on that
>> column but I'm thinking that it won't be of much use in this situation. Is
>> there a way to facilitate these queries?
>>
>> Example:
>>
>> create table foo (col0 text, col1 text);
>>
>> create index foo_col1 on foo (col1);
>>
>> insert into foo (col0,col1) values
>> ('moe','aa,bbb,c'),('larry','xxxxx,bbb,yyy'),('curly','m,nnnn,oo');
>>
>> now...
>>
>> select col0 from foo where <the csv element 'bbb' exists as a csv element of
>> col1>
>>
>>
>> Some attempts, which get the right answers, but which probably won't be very
>> efficient...
>>
>> select col0 from foo where string_to_array('bbb','') <@
>> string_to_array(col1);
>>
>> select col0 from foo where ','||col1||',' like '%,bbb,%';
>>
>> select col0 from foo where ((col1 like 'bbb,%') or (col1 like '%,bbb,%') or
>> (col1 like '%,bbb'));
>>
>> Long shot, but I thought I'd ask anyway.
>
> A GIN index might do the trick:
>
> CREATE INDEX ix_col1_ix ON foo USING GIN(string_to_array(col1,','));
>
> (This is assuming the CSV values can be cleanly converted to
> an array using "string_to_array()").
>
> You could then query it with:
> SELECT col0 FROM foo WHERE string_to_array(col1,',') @> '{bbb}'::text[];

Just out of interest, I populated the table with around 1,000,000 rows of
randomly generated data (three items of random upper case characters
in col1), results with and without index below (using an untuned 9.2
installation
on a laptop with a slow hard drive).
Note that adding the index doubled the total table size, which might
be something
to watch out for if the table is very big and you have a lot of unique
values in the
"CSV" column.

Regards

Ian Barwick

testdb=# SELECT * from foo where string_to_array(col1,',') @> '{PKRY}'::text[];
col0 | col1
--------+-------------------
ARWC | JIJ,MBDVU,PKRY
FUNWOA | JKEK,PKRY,MQFUQTJ
PJGTD | KSO,HSTB,PKRY
(3 rows)

Time: 1325.536 ms
testdb=# explain SELECT * from foo where string_to_array(col1,',') @>
'{PKRY}'::text[];
QUERY PLAN
------------------------------------------------------------------
Seq Scan on foo (cost=0.00..28400.42 rows=5021 width=76)
Filter: (string_to_array(col1, ','::text) @> '{PKRY}'::text[])
(2 rows)

testdb=# CREATE INDEX ix_col1_ix ON foo using gin(string_to_array(col1,','));
CREATE INDEX
Time: 170533.158 ms
testdb=# ANALYZE foo;
ANALYZE
Time: 1431.665 ms
testdb=# SELECT * from foo where string_to_array(col1,',') @> '{PKRY}'::text[];
col0 | col1
--------+-------------------
ARWC | JIJ,MBDVU,PKRY
FUNWOA | JKEK,PKRY,MQFUQTJ
PJGTD | KSO,HSTB,PKRY
(3 rows)

Time: 0.906 ms
testdb=# explain SELECT * from foo where string_to_array(col1,',') @>
'{PKRY}'::text[];
QUERY PLAN
----------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=20.79..389.58 rows=101 width=24)
Recheck Cond: (string_to_array(col1, ','::text) @> '{PKRY}'::text[])
-> Bitmap Index Scan on ix_col1_ix (cost=0.00..20.76 rows=101 width=0)
Index Cond: (string_to_array(col1, ','::text) @> '{PKRY}'::text[])
(4 rows)

Time: 0.377 ms

In response to

Re: indexing elements of a csv ? at 2013-03-12 15:02:20 from Ian Lawrence Barwick

Browse pgsql-general by date

	From	Date	Subject
Next Message	Perry Smith	2013-03-12 15:41:46	Testing Technique when using a DB
Previous Message	Steve Erickson	2013-03-12 15:19:17	Re: indexing elements of a csv ?