From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare() |
Date: | 2020-01-26 22:49:06 |
Message-ID: | CAH2-Wzngz5MrkiTaZNny0GzfTxNQE+QWPPaO-C390BgruhgjEw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Andres and I discussed bottlenecks in the nbtree code during the
recent PgDay SF community event. Andres observed that the call to
BTreeTupleGetNAtts() in _bt_compare() can become somewhat of a
bottleneck. I came up with the very simple attached POC-quality
patches, which Andres tested and profiled with his original complaint
in mind yesterday. He reported increased throughput on a memory
bound simple pgbench SELECT-only workload.
He reported that the problem went away with the patches applied. The
following pgbench SELECT-only result was sent to me privately:
before:
single: tps = 30300.202363 (excluding connections establishing)
all cores: tps = 1026853.447047 (excluding connections establishing)
after:
single: tps = 31120.452919 (excluding connections establishing)
all cores: tps = 1032376.361006 (excluding connections establishing)
(Actually, he tested something slightly different -- he inlined
_bt_compare() in his own way instead of using my 0002-*, and didn't
use the 0003-* optimization at all.)
Apparently this was a large multi-socket machine. Those are hard to
come by.
The main idea here is to make _bt_compare() delay
calling BTreeTupleGetNAtts() until the point after the first attribute
turns out to be equal to the _bt_compare() caller's insertion scankey.
Many or most calls to _bt_compare() won't even need to call
BTreeTupleGetNAtts().
This relies on the assumption that any tuple must have at least one
untruncated suffix column in the _bt_compare() loop. It doesn't matter
whether it's a pivot or non-pivot tuple -- the representation of the
first column will be exactly the same.
The negative infinity item on an internal page always has zero
attributes, which might seem like a snag. However, we already treat
that as a special case within _bt_compare(), for historical reasons
(pg_upgrade'd indexes won't have the number of attributes explicitly
set to zero in some cases).
Another separate though related idea in 0003-* is to remove the
negative infinity check. It goes from _bt_compare() to _bt_binsrch(),
since it isn't something that we need to consider more than once per
page -- and only on internal pages. That way, _bt_compare() doesn't
have to look at the page special area to check if it's a leaf page or
an internal page at all. I haven't really profiled this just yet. This is
one of those patches where 95%+ of the work is profiling and
benchmarking.
Andres and I both agree that there is a lot more work to be done in
this area, but that will be a major undertaking. I am quite keen on
the idea of repurposing the redundant-to-nbtree ItemId.lp_len field to
store an abbreviated key. Making that work well is a considerable
undertaking, since you need to use prefix compression to get a high
entropy abbreviated key. It would probably take me the best part of a
whole release cycle to write such a patch. The attached patches get
us a relatively easy win in the short term, though.
Thoughts?
--
Peter Geoghegan
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Avoid-calling-BTreeTupleGetNAtts-in-_bt_compare.patch | application/octet-stream | 2.2 KB |
v1-0003-Remove-negative-infinity-check-from-_bt_compare.patch | application/octet-stream | 2.8 KB |
v1-0002-Inline-_bt_compare.patch | application/octet-stream | 2.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2020-01-26 22:49:10 | Re: Parallel leader process info in EXPLAIN |
Previous Message | Andres Freund | 2020-01-26 22:42:54 | Re: EXPLAIN's handling of output-a-field-or-not decisions |