pgsql: Fix regex engine to suppress useless concatenation sub-REs.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Fix regex engine to suppress useless concatenation sub-REs.
Date: 2021-02-21 00:26:59
Message-ID: E1lDcaR-0004iY-Hb@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix regex engine to suppress useless concatenation sub-REs.

The comment for parsebranch() claims that it avoids generating
unnecessary concatenation nodes in the "subre" tree, but it missed
some significant cases. Once we've decided that a given atom is
"messy" and can't be bundled with the preceding atom(s) of the
current regex branch, parseqatom() always generated two new concat
nodes, one to concat the messy atom to what follows it in the branch,
and an upper node to concatenate the preceding part of the branch
to that one. But one or both of these could be unnecessary, if the
messy atom is the first, last, or only one in the branch. Improve
the code to suppress such useless concat nodes, along with the
no-op child nodes representing empty chunks of a branch.

Reducing the number of subre tree nodes offers significant savings
not only at execution but during compilation, because each subre node
has its own NFA that has to be separately optimized. (Maybe someday
we'll figure out how to share the optimization work across multiple
tree nodes, but it doesn't look easy.) Eliminating upper tree nodes
is especially useful because they tend to have larger NFAs.

This is part of a patch series that in total reduces the regex engine's
runtime by about a factor of four on a large corpus of real-world regexes.

Patch by me, reviewed by Joel Jacobson

Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/cebc1d34e5207c37871708f91be65dd839760b5f

Modified Files
--------------
src/backend/regex/regcomp.c | 83 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 76 insertions(+), 7 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2021-02-22 00:46:55 pgsql: Fix invalid array access in trgm_regexp.c.
Previous Message Michael Paquier 2021-02-20 01:31:58 pgsql: doc: Mention that partitions_{done,total} is 0 for REINDEX progr