Re: Using regexp from table has unpredictable poor performance

From: Vitalii Tymchyshyn <vit(at)tym(dot)im>
To: Jack Christensen <jack(at)jncsoftware(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Using regexp from table has unpredictable poor performance
Date: 2021-08-25 22:16:26
Message-ID: CABWW-d3Zcc2pYiOp66iTiOcFXj2_JhnGm_2rRVX5U=u-xavWtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Btw: if you still run out of cache later with more regexes may be it makes
sense to do prefiltering first my making a single gigantic regexp as
string_agg(‘(‘||name_matches||’)’,’|’) and then only filter ones that match
later. If postgresql provides capturing groups you may even be able to
explode the result without postfilter.

ср, 25 серп. 2021 о 14:22 Jack Christensen <jack(at)jncsoftware(dot)com> пише:

> The optimizer was a bit too clever. It used the same plan for the LEFT
> JOIN. But that put me on the right track. I tried a LATERAL join. But the
> optimizer saw through that too and used the same plan. So I tried a
> materialized CTE and that finally forced it to use a different plan. That
> made it run in ~70ms -- about 18x faster. Thanks!
>
> explain analyze
> with r as materialized (
> select * from matching_rules
> where id >= 0 and id < 60
> )
> select r.id, i.id
> from r
> join items i on i.name ~ r.name_matches
> ;
>
> QUERY PLAN
>
> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
> Nested Loop (cost=2.78..714.20 rows=230 width=8) (actual
> time=0.071..69.545 rows=702 loops=1)
> Join Filter: (i.name ~ r.name_matches)
> Rows Removed by Join Filter: 45298
> CTE r
> -> Seq Scan on matching_rules (cost=0.00..2.78 rows=46 width=26)
> (actual time=0.007..0.047 rows=46 loops=1)
> Filter: ((id >= 0) AND (id < 60))
> Rows Removed by Filter: 6
> -> CTE Scan on r (cost=0.00..0.92 rows=46 width=36) (actual
> time=0.008..0.090 rows=46 loops=1)
> -> Materialize (cost=0.00..23.00 rows=1000 width=27) (actual
> time=0.000..0.081 rows=1000 loops=46)
> -> Seq Scan on items i (cost=0.00..18.00 rows=1000 width=27)
> (actual time=0.003..0.092 rows=1000 loops=1)
> Planning Time: 0.206 ms
> Execution Time: 69.633 ms
>
>
> On Wed, Aug 25, 2021 at 4:05 PM Justin Pryzby <pryzby(at)telsasoft(dot)com>
> wrote:
>
>> On Wed, Aug 25, 2021 at 11:47:43AM -0500, Jack Christensen wrote:
>> > I have items that need to be categorized by user defined matching rules.
>> > Trusted users can create rules that include regular expressions. I've
>> > reduced the problem to this example.
>>
>> > I use the following query to find matches:
>> >
>> > select r.id, i.id
>> > from items i
>> > join matching_rules r on i.name ~ r.name_matches;
>> >
>> > When there are few rules the query runs quickly. But as the number of
>> rules
>> > increases the runtime often increases at a greater than linear rate.
>>
>> Maybe it's because the REs are cached by RE_compile_and_cache(), but if
>> you
>> loop over the REs in the inner loop, then the caching is ineffecive.
>>
>> Maybe you can force it to join with REs on the outer loop by writing it
>> as:
>> | rules LEFT JOIN items WHERE rules.id IS NOT NULL,
>> ..to improve performance, or at least test that theory.
>>
>> --
>> Justin
>>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message ldh@laurent-hasson.com 2021-08-26 14:47:54 RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4
Previous Message Jack Christensen 2021-08-25 21:21:55 Re: Using regexp from table has unpredictable poor performance