Re: Allow ILIKE forward matching to use btree index

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Yugo NAGATA <nagata(at)sraoss(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Allow ILIKE forward matching to use btree index
Date: 2025-01-15 22:34:52
Message-ID: 5aa34780cda992afb7861316ec7b1563db8b9d67.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2025-01-15 at 01:40 +0900, Yugo NAGATA wrote:
> > > For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND
> > > t < '123fop')"
> > > and index scan can be used for this condition. On the other hand,
> > > "t ~~* '123foo'"
> > > cannot be converted and sequential scan is used.
> > >
> > > Even in this case, we can use a bitmap index scan for the
> > > condition
> > > "(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')"
> > > followed by
> > > recheck by the original condition "t ~~* '123foo'", and this
> > > could be faster
> > > than seqscan.

In theory, there could be many OR clauses:

(t >= '123foo' AND t < '123fop') OR
(t >= '123foo' AND t < '123fop') OR
(t >= '123foo' AND t < '123fop') OR
(t >= '123foo' AND t < '123fop') OR
(t >= '123foo' AND t < '123fop') OR
(t >= '123foo' AND t < '123fop') OR
> > > The patch 0001 enables get_index_clause_from_support to receive
> > > OR clauses
> > > from support functions and use them to build bitmap index scan
> > > later. OR clauses
> > > returned by support functions are collected in the new argument
> > > 'orclauses" of
> > > match_restriction_clauses_to_index(), and they are passed to
> > > generate_bitmap_or_paths() later to build bitmap scan paths.

If I understand correctly, this

> > > The patch 0002 modifies the support function to return OR clauses
> > > as an example
> > > above when ILIKE's pattern has case-varying characters in forward
> > > matching. The
> > > OR clauses contains two index conditions for the upper and the
> > > lower letter of
> > > the first case-varying character in the pattern respectively.
> > > Although the
> > > subsequent characters are not considered in the index scans, it
> > > could enough be
> > > faster then sequent scan.
> > >
> > > Here is an example.
> > >
> > > 1. Create a table with random text records
> > >
> > > =# CREATE TABLE tbl (t text);
> > > =# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x
> > > END
> > >                   FROM (SELECT i, md5(i::text) x FROM
> > > generate_series(1,5000000) i);
> > >
> > > 2. Create an index
> > > =# CREATE INDEX ON tbl (t text_pattern_ops);
> > >
> > > 3. Before applying patch: Seq Scan was selected. It takes about 4
> > > sec.
> > >
> > > =# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';
> >
> > I am sorry, the example in my previous main was wrong. I showed the
> > plan
> > with enable_index_scan = off. Indead, if the pattern starts with
> > case-insensitive characters like '12', an index scan can be used
> > even with
> > ILIKE.
> >
> > postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
> >                                                        QUERY
> > PLAN                                                       
> > -------------------------------------------------------------------
> > ------------------------------------------------------
> >  Gather  (cost=1000.00..69129.61 rows=500 width=33) (actual
> > time=36.317..4034.770 rows=1188 loops=1)
> >    Workers Planned: 2
> >    Workers Launched: 2
> >    Buffers: shared hit=99 read=41939
> >    ->  Parallel Seq Scan on tbl  (cost=0.00..68079.61 rows=208
> > width=33) (actual time=19.908..4023.668 rows=396 loops=3)
> >          Filter: (t ~~* 'abc%'::text)
> >          Rows Removed by Filter: 1666271
> >          Buffers: shared hit=99 read=41939
> >  Planning Time: 0.726 ms
> >  Execution Time: 4035.101 ms
> > (10 rows)
> >
> > 4. After applying patch: Bitmap Index Scan was selected.
> >
> > postgres=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE t ILIKE 'abc%';
> >                                                                    
> >        QUERY
> > PLAN                                                               
> >           
> > -------------------------------------------------------------------
> > -------------------------------------------------------------------
> > ------------------------
> >  Bitmap Heap Scan on tbl  (cost=12563.66..58314.53 rows=500
> > width=33) (actual time=156.485..1266.789 rows=1188 loops=1)
> >    Recheck Cond: (((t ~>=~ 'A'::text) AND (t ~<~ 'B'::text) AND (t
> > ~~* 'abc%'::text)) OR ((t ~>=~ 'a'::text) AND (t ~<~ 'b'::text) AND
> > (t ~~* 'abc%'::text)))
> >    Filter: (t ~~* 'abc%'::text)
> >    Rows Removed by Filter: 311473
> >    Heap Blocks: exact=42010
> >    Buffers: shared hit=1 read=44600
> >    ->  BitmapOr  (cost=12563.66..12563.66 rows=297029 width=0)
> > (actual time=136.979..136.980 rows=0 loops=1)
> >          Buffers: shared hit=1 read=2590
> >          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71
> > rows=148515 width=0) (actual time=80.548..80.549 rows=158375
> > loops=1)
> >                Index Cond: ((t ~>=~ 'A'::text) AND (t ~<~
> > 'B'::text))
> >                Buffers: shared read=1272
> >          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..6281.71
> > rows=148515 width=0) (actual time=56.427..56.427 rows=157042
> > loops=1)
> >                Index Cond: ((t ~>=~ 'a'::text) AND (t ~<~
> > 'b'::text))
> >                Buffers: shared hit=1 read=1318
> >  Planning Time: 0.592 ms
> >  Execution Time: 1267.162 ms
> > (16 rows)
> >
> >  
> > > What do you think about it?
> > >
> > > I think another application of OR-clause returning support
> > > function might be
> > > allowing to use an index for NOT LIKE (!~~) expression  because,
> > > for example,
> > > "t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >=
> > > '123fooz')".
> > > (The second condition is a bit loose but this would be safe and
> > > not problem
> > > since tuples are filtered by the original condition after the
> > > index scan.)
> > > However, it would not very useful unless the distribution is much
> > > skew so that
> > > NOT LIKE expression's selectivity is enough small.
>
> I added a new patch 0003 that enables ILIKE forward matching to use a
> SP-Gist index.
> Similar to 0002, this generates BitmapOr Index Scan for two index
> clauses that use
> "^@" operator for upper letter and lower letter pattern respectively.
>
> * Before applying the patch:
>
> postgres=# explain analyze select* from tbl where t ilike 'abc%';
>                                                       QUERY
> PLAN                                                     
> ---------------------------------------------------------------------
> -------------------------------------------------
>  Gather  (cost=1000.00..18899.52 rows=101 width=33) (actual
> time=41.799..930.382 rows=253 loops=1)
>    Workers Planned: 2
>    Workers Launched: 2
>    Buffers: shared hit=96 read=12533
>    ->  Parallel Seq Scan on tbl  (cost=0.00..17889.42 rows=42
> width=33) (actual time=26.041..917.570 rows=84 loops=3)
>          Filter: (t ~~* 'abc%'::text)
>          Rows Removed by Filter: 336582
>          Buffers: shared hit=96 read=12533
>  Planning Time: 0.690 ms
>  Execution Time: 930.545 ms
> (10 rows)
>
>
> * After applying the patch:
>
> postgres=# explain analyze select* from tbl where t ilike 'abc%';
>                                                              QUERY
> PLAN                                                             
> ---------------------------------------------------------------------
> ----------------------------------------------------------------
>  Bitmap Heap Scan on tbl  (cost=3307.96..16702.11 rows=101 width=33)
> (actual time=69.449..215.636 rows=253 loops=1)
>    Recheck Cond: (((t ^@ 'A'::text) AND (t ~~* 'abc%'::text)) OR ((t
> ^@ 'a'::text) AND (t ~~* 'abc%'::text)))
>    Filter: (t ~~* 'abc%'::text)
>    Rows Removed by Filter: 62992
>    Heap Blocks: exact=12437
>    Buffers: shared hit=18529
>    ->  BitmapOr  (cost=3307.96..3307.96 rows=61212 width=0) (actual
> time=62.567..62.568 rows=0 loops=1)
>          Buffers: shared hit=6092
>          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..1653.96
> rows=30606 width=0) (actual time=41.893..41.893 rows=31536 loops=1)
>                Index Cond: (t ^@ 'A'::text)
>                Buffers: shared hit=2461
>          ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..1653.96
> rows=30606 width=0) (actual time=20.671..20.671 rows=31709 loops=1)
>                Index Cond: (t ^@ 'a'::text)
>                Buffers: shared hit=3631
>  Planning Time: 1.414 ms
>  Execution Time: 215.903 ms
> (16 rows)
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2025-01-15 22:40:19 Re: Allow ILIKE forward matching to use btree index
Previous Message Peter Eisentraut 2025-01-15 22:20:58 Re: Use Python "Limited API" in PL/Python