From: | "Joel Jacobson" <joel(at)compiler(dot)org> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Some regular-expression performance hacking |
Date: | 2021-02-15 08:21:21 |
Message-ID: | 99f6cbe0-6e58-4e41-add0-977271a5cf84@www.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Feb 15, 2021, at 04:11, Tom Lane wrote:
>I got these runtimes (non-cassert builds):
>
>HEAD 313661.149 ms (05:13.661)
>+0001 297397.293 ms (04:57.397) 5% better than HEAD
>+0002 151995.803 ms (02:31.996) 51% better than HEAD
>+0003 139843.934 ms (02:19.844) 55% better than HEAD
>+0004 95034.611 ms (01:35.035) 69% better than HEAD
>
>Since I don't have all the tables used in your query, I can't
>try to reproduce your results exactly. I suspect the reason
>I'm getting a better percentage improvement than you did is
>that the joining/grouping/ordering involved in your query
>creates a higher baseline query cost.
Mind blowing speed-up, wow!
I've tested all 4 patches successfully.
To eliminate the baseline cost of the join,
I first created this table:
CREATE TABLE performance_test AS
SELECT
subjects.subject,
patterns.pattern,
tests.is_match,
tests.captured
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
;
Then I ran this query:
\timing
SELECT
is_match <> (subject ~ pattern),
captured IS DISTINCT FROM regexp_match(subject, pattern),
COUNT(*)
FROM performance_test
GROUP BY 1,2
ORDER BY 1,2
;
All patches gave the same result:
?column? | ?column? | count
----------+----------+---------
f | f | 1448212
(1 row)
I.e., no detected semantic differences.
Timing differences:
HEAD 570632.722 ms (09:30.633)
+0001 472938.857 ms (07:52.939) 17% better than HEAD
+0002 451638.049 ms (07:31.638) 20% better than HEAD
+0003 439377.813 ms (07:19.378) 23% better than HEAD
+0004 96447.038 ms (01:36.447) 83% better than HEAD
I tested on my MacBook Pro 2.4GHz 8-Core Intel Core i9, 32 GB 2400 MHz DDR4 running macOS Big Sur 11.1:
SELECT version();
version
----------------------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-apple-darwin20.2.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit
(1 row)
My HEAD = 46d6e5f567906389c31c4fb3a2653da1885c18ee.
PostgreSQL was compiled with just ./configure, no parameters, and the only non-default postgresql.conf settings were these:
log_destination = 'csvlog'
logging_collector = on
log_filename = 'postgresql.log'
Amazing work!
I hope to have a new dataset ready soon with regex flags for applied subjects as well.
/Joel
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2021-02-15 08:27:21 | Re: ERROR: invalid spinlock number: 0 |
Previous Message | Greg Nancarrow | 2021-02-15 07:39:36 | Re: Parallel INSERT (INTO ... SELECT ...) |