Re: Some regular-expression performance hacking

From: "Joel Jacobson" <joel(at)compiler(dot)org>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Some regular-expression performance hacking
Date: 2021-02-18 10:30:09
Message-ID: 5deabab3-a335-4b57-a046-8a9ad8f2f142@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 17, 2021, at 22:00, Tom Lane wrote:
> Attached is an updated patch series; it's rebased over 4e703d671
> which took care of some not-really-related fixes, and I made a
> pass of cleanup and comment improvements. I think this is pretty
> much ready to commit, unless you want to do more testing or
> code-reading.

I've produced a new dataset which now also includes the regex flags (if any) used for each subject applied to a pattern.

The new dataset contains 318364 patterns and 4474520 subjects.
(The old one had 235204 patterns and 1489489 subjects.)

I've tested the new dataset against PostgreSQL 10.16, 11.11, 12.6, 13.2, HEAD (4e703d671) and HEAD+patches.

I based the comparisons on the subjects that didn't cause an error on 13.2:

CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
patterns.flags,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
WHERE tests.error IS NULL
;

I then measured the query below for each PostgreSQL version:

\timing
SELECT version();
SELECT
is_match <> (subject ~ pattern) AS is_match_diff,
captured IS DISTINCT FROM regexp_match(subject, pattern, flags) AS captured_diff,
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;

All versions produces the same result:

is_match_diff | captured_diff | count
---------------+---------------+---------
f | f | 3254769
(1 row)

Good! Not a single case that differs of over 3 million different regex pattern/subject combinations,
between five major PostgreSQL versions! That's a very stable regex engine.

To get a feeling for the standard deviation of the timings,
I executed the same query above three times for each PostgreSQL version:

PostgreSQL 10.16 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (clang-700.1.81), 64-bit
Time: 795674.830 ms (13:15.675)
Time: 794249.704 ms (13:14.250)
Time: 771036.707 ms (12:51.037)

PostgreSQL 11.11 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 765466.191 ms (12:45.466)
Time: 787135.316 ms (13:07.135)
Time: 779582.635 ms (12:59.583)

PostgreSQL 12.6 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
Time: 785500.516 ms (13:05.501)
Time: 784511.591 ms (13:04.512)
Time: 786727.973 ms (13:06.728)

PostgreSQL 13.2 on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit
Time: 758514.703 ms (12:38.515)
Time: 755883.600 ms (12:35.884)
Time: 746522.107 ms (12:26.522)

PostgreSQL 14devel on x86_64-apple-darwin20.3.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
HEAD (4e703d671)
Time: 519620.646 ms (08:39.621)
Time: 518998.366 ms (08:38.998)
Time: 519696.129 ms (08:39.696)

HEAD (4e703d671)+0001+0002+0003+0004+0005
Time: 141290.329 ms (02:21.290)
Time: 141849.709 ms (02:21.850)
Time: 141630.819 ms (02:21.631)

That's a mind-blowing speed-up!

I also ran the more detailed test between 13.2 and HEAD+patches,
that also tests for differences in errors.

Like before, one similar improvement was found,
which previously resulted in an error, but now goes through OK:

SELECT * FROM vdeviations;
-[ RECORD 1 ]----+-------------------------------------------------------------------------------------------------------
pattern | \.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mi ... 100497 chars ... abs\.org|yolasite\.com|za\.net|za\.org)$
flags |
subject | www.aeroexpo.online
count | 1
a_server_version | 13.2
a_duration | 00:00:00.298253
a_is_match |
a_captured |
a_error | invalid regular expression: regular expression is too complex
b_server_version | 14devel
b_duration | 00:00:00.665958
b_is_match | t
b_captured | {online}
b_error |

Very nice.

I've uploaded the new dataset to the same place as before.

The schema for it can be found at https://github.com/truthly/regexes-in-the-wild

If anyone else would like a copy of the 715MB dataset, please let me know.

/Joel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-02-18 10:38:08 Re: 64-bit XIDs in deleted nbtree pages
Previous Message Guillaume Lelarge 2021-02-18 10:13:06 Re: Extensions not dumped when --schema is used