Re: substring index what is better way to query

From: Tieson Molly <tvmaly(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: substring index what is better way to query
Date: 2017-06-20 14:40:43
Message-ID: CADPSLQbeeAev3aYa4pfmHZF9dJAderXzw_sKywsGzxPjvNn1AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom,
is there a different construct than the Similar To that would work?

I know for certain that the first few characters could be different due to
the nature of geohashes. So I may not be able to optimize the prefix
aspect in some cases.

Best regards,

Ty

On Jun 20, 2017 10:19 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Tieson Molly <tvmaly(at)gmail(dot)com> writes:
> > I have an index I created on the prefix of a column:
>
> > create index location_geo_idx ON locations( substring(geohash, 0, 5));
>
> > I currently use a similar to query, but I wanted to know if there is a
> > faster way to query multiple value using this index than this?
>
> > select l.geohash from locations l where l.geohash similar to
> '(dr7g|dr7e)%';
>
> Well, you've got a couple of problems there. The most basic one is that
> that index doesn't match that query at all. You need to arrange things
> so that the lefthand side of the SIMILAR TO operator is exactly the
> indexed value, not something that's related to it. (Yes, in principle
> that index could be used to answer this query, but it would require a
> great deal more intimate knowledge than the planner has about the
> semantics of both substring() and SIMILAR TO.) IOW, you need to write
>
> select l.geohash from locations l
> where substring(l.geohash, 0, 5) similar to '(dr7g|dr7e)%';
>
> The other possible solution would be to just index the geohash strings
> verbatim; unless they are quite long, that's what I'd recommend, usually.
>
> Secondly, if you're using a non-C locale, you're likely not getting an
> indexscan plan anyway; check it with EXPLAIN. To get an indexed prefix
> search out of a pattern match, the index has to use C sorting rules,
> which you can force with a COLLATE or text_pattern_ops option if the
> database's prevailing locale isn't C.
>
> Thirdly, if you experiment with EXPLAIN a little bit, you'll soon realize
> that the planner is not great at extracting common prefix strings out of
> OR'd pattern branches:
>
> regression=# create table loc (f1 text unique);
> CREATE TABLE
> regression=# explain select * from loc where f1 similar to '(dr7g|dr7e)%';
> QUERY PLAN
> -------------------------------------------------------------------------
> Bitmap Heap Scan on loc (cost=4.22..14.37 rows=1 width=32)
> Filter: (f1 ~ '^(?:(?:dr7g|dr7e).*)$'::text)
> -> Bitmap Index Scan on loc_f1_key (cost=0.00..4.22 rows=7 width=0)
> Index Cond: ((f1 >= 'd'::text) AND (f1 < 'e'::text))
> (4 rows)
>
> The useful part of this for speed purposes is the "Index Cond", and
> you can see that it's only enforcing that the first character be "d".
> I don't remember that code very well at the moment, but I'm a bit
> surprised that it's even figured out that the "d" is common to both
> branches. You can get a lot more traction if you factor the common
> prefix manually:
>
> regression=# explain select * from loc where f1 similar to 'dr7(g|e)%';
> QUERY PLAN
> -------------------------------------------------------------------------
> Bitmap Heap Scan on loc (cost=4.22..14.37 rows=1 width=32)
> Filter: (f1 ~ '^(?:dr7(?:g|e).*)$'::text)
> -> Bitmap Index Scan on loc_f1_key (cost=0.00..4.22 rows=7 width=0)
> Index Cond: ((f1 >= 'dr7'::text) AND (f1 < 'dr8'::text))
> (4 rows)
>
> or maybe even
>
> regression=# explain select * from loc where f1 similar to 'dr7g%' or f1
> similar to 'dr7e%';
> QUERY PLAN
> ------------------------------------------------------------
> -------------------
> Bitmap Heap Scan on loc (cost=8.45..19.04 rows=2 width=32)
> Recheck Cond: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~
> '^(?:dr7e.*)$'::text))
> Filter: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~ '^(?:dr7e.*)$'::text))
> -> BitmapOr (cost=8.45..8.45 rows=14 width=0)
> -> Bitmap Index Scan on loc_f1_key (cost=0.00..4.22 rows=7
> width=0)
> Index Cond: ((f1 >= 'dr7g'::text) AND (f1 < 'dr7h'::text))
> -> Bitmap Index Scan on loc_f1_key (cost=0.00..4.22 rows=7
> width=0)
> Index Cond: ((f1 >= 'dr7e'::text) AND (f1 < 'dr7f'::text))
> (8 rows)
>
> Whether this is worth the trouble depends a lot on your data distribution,
> but any of them are probably better than the seqscan you're no doubt
> getting right now.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2017-06-20 14:45:47 Re: substring index what is better way to query
Previous Message Tom Lane 2017-06-20 14:19:09 Re: substring index what is better way to query