From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Keep elog(ERROR) and ereport(ERROR) calls in the cold path |
Date: | 2020-06-25 01:50:37 |
Message-ID: | CAApHDvrVpasrEzLL2er7p9iwZFZ=Jj6WisePcFeunwfrV0js_A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Back in [1] I experimented with a patch to coax compilers to build all
elog/ereport calls that were >= ERROR into a cold path away from the
function rasing the error. At the time, I really just wanted to test
how much of a speedup we could get by doing this and ended up just
writing up a patch that basically changed all elog(ERROR) calls from:
if (<error situation check>)
{
<do stuff>;
elog(ERROR, "...");
}
to add an unlikely() and become;
if (unlikely(<error situation check>)
{
<do stuff>;
elog(ERROR, "...");
}
Per the report in [1] I did see some pretty good gains in performance
from doing this. The problem was, that at the time I couldn't figure
out a way to do this without an invasive patch that changed the code
in the thousands of elog/ereport calls.
In the attached, I've come up with a way that works. Basically I've
just added a function named errstart_cold() that is attributed with
__attribute__((cold)), which will hint to the compiler to keep
branches which call this function in a cold path. To make the
compiler properly mark just >= ERROR calls as cold, and only when the
elevel is a constant, I modified the ereport_domain macro to do:
if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \
errstart_cold(elevel, domain) : \
errstart(elevel, domain)) \
I see no reason why the compiler shouldn't always fold that constant
expression at compile-time and correctly select the correct version of
the function for the job. (I also tried using __builtin_choose_expr()
but GCC complained when the elevel was not constant, even with the
__builtin_constant_p() test in the condition)
I sampled a few .s files to inspect what code had changed. Looking at
mcxt.s, fmgr.s and xlog.s, the first two of these because I found in
[1] that elogs were being done from those files quite often and xlog.s
because it's pretty big. As far as I could see, GCC correctly moved
all the error reporting stuff where the elevel was a constant and >=
ERROR into the cold path and left the lower-level or non-consts elevel
calls alone.
For clang, I didn't see any changes in the .s files. I suspect that
they might have a few smarts in there and see the
__builtin_unreachable() call and assume the path is cold already based
on that. That was with clang 10. Perhaps older versions are not as
smart.
Benchmarking:
For benchmarking, I've not done a huge amount to test the impacts of
this change. However, I can say that I am seeing some fairly good
improvements. There seems to be some small improvements to execution
speed using TPCH-Q1 and also some good results from a pgbench -S test.
For TPCH-Q1:
Master:
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5272.630 ms
latency average = 5258.610 ms
latency average = 5250.871 ms
Master + elog_ereport_attribute_cold.patch
$ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch
latency average = 5182.761 ms
latency average = 5194.851 ms
latency average = 5183.128 ms
Which is about a 1.42% increase in performance. That's not exactly
groundbreaking, but pretty useful to have if that happens to apply
across the board for execution performance.
For pgbench -S:
My results were a bit noisier than the TPCH test, but the results I
obtained did show about a 10% increase in performance:
Master:
drowley(at)amd3990x:~$ pgbench -S -T 120 postgres
tps = 25245.903255 (excluding connections establishing)
tps = 26144.454208 (excluding connections establishing)
tps = 25931.850518 (excluding connections establishing)
Master + elog_ereport_attribute_cold.patch
drowley(at)amd3990x:~$ pgbench -S -T 120 postgres
tps = 28351.480631 (excluding connections establishing)
tps = 27763.656557 (excluding connections establishing)
tps = 28896.427929 (excluding connections establishing)
It would be useful if someone with some server-grade Intel hardware
could run a few tests on this. The above results are all from AMD
hardware.
I've attached the patch for this. I'll add it to the July 'fest.
David
Attachment | Content-Type | Size |
---|---|---|
elog_ereport_attribute_cold.patch | application/octet-stream | 3.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2020-06-25 02:07:33 | Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary |
Previous Message | Michael Paquier | 2020-06-25 01:50:26 | Re: should libpq also require TLSv1.2 by default? |