From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | dmigowski(at)ikoffice(dot)de, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18205: Performance regression with NOT NULL checks. |
Date: | 2023-11-19 21:12:03 |
Message-ID: | 20231119211203.o5qczelolmbczawt@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
On 2023-11-19 14:08:05 -0500, Tom Lane wrote:
> So that results in not having to deconstruct most of the tuple,
> whereas in the new code we do have to, thanks to b8d7f053c's
> decision to batch all the variable-value-extraction work.
Yea, I think we were aware at the time that this does have downsides - it's
just that the worst case behaviour of *not* batching are much bigger than the
worst-case downside of batching.
> This is a pretty narrow corner case: it would only matter if the
> column you're testing for null-ness is far past any other column
> the query needs to fetch. So I'm not sure that it's worth doing
> anything about. You could imagine special processing for
> NullTest-on-a-simple-Var: exclude the Var from the ones that we
> extract normally and instead compile it into some sort of "is column k
> NULL" test on the HeapTuple.
We actually did add fastpaths for a few similar cases: ExecJustInnerVar() etc
will just use slot_getattr(). These can be used when the result is just a
single variable. However, the goal there was more to avoid "interpreter
startup" overhead, rather than evaluation overhead.
Those fastpaths don't match in this case though, the generated "program" is:
EEOP_SCAN_FETCHSOME
EEOP_SCAN_VAR
EEOP_NULLTEST_ISNULL
EEOP_QUAL
EEOP_DONE
which matches none of the fastpaths we recognize. In fact, I don't think any
of the fastpaths currently can be applied to quals, which is somewhat sad :(
Right now the "pattern matching" happens in ExecReadyInterpretedExpr(). For
portions that makes sense (as they're interpretation specific), but the
transformation from EEOP_SCAN_FETCHSOME to just fetching a single column imo
should be moved into execExpr.c, by making expr_setup_walker() a bit
smarter. I think that might help in a fair number of cases - quals on a single
column aren't exactly rare.
In this specific case we also could elide the EEOP_QUAL, the NULLTEST cannot
return a NULL, so we don't need any qual specific behaviour. For this case
alone it'd probably not be worth tracking whether something can be NULL. But
it could be worthwhile if we made it more generic, e.g. eliding strictness
checks before function calls could be a nice win.
> But that seems messy, and it could be a significant pessimization for
> storage methods that aren't like heapam.
My concern is more that they're pretty narrow - they apply only if a single
column is referenced. If we just need attributed 100 and 101, it's still
expensive to deform leading columns. To address that, Ithink we could benefit
from a tuple deforming routine that's more aware of which columns are
needed. Even just omitting storing values in tts_values/nulls that aren't
needed yielded decent wins when I experimented in the past. The hard part of
course is to figure out when to use a "selective" deforming routine, which
will have *worse* performance if most columns are used, and when to just
deform everything up to natts.
I think the current deforming logic is actually optimized towards heapam to a
problematic degree - deforming leading columns in a columnar database hurts
really badly.
> On the whole I'm inclined to say "sorry, that's the price of
> progress".
Agreed, we can't really go back to deforming individual columns more widely -
that can get *really* expensive. But I think we can do plenty to make the
performance of this specific case, both by improving the generated expression
program and:
> But it is a bit sad that a patch intended to be a
> performance win shows a loss this big on a (presumably) real-world
> case.
I think we might be able to micro-optimize that code a bit to reduce the
decrease in performance. Neither gcc nor clang is able to to optimize the cost
of the null-check meaningfully, and there a number of dependent instructions:
if (hasnulls && att_isnull(attnum, bp))
{
values[attnum] = (Datum) 0;
isnull[attnum] = true;
slow = true; /* can't use attcacheoff anymore */
continue;
}
static inline bool
att_isnull(int ATT, const bits8 *BITS)
{
return !(BITS[ATT >> 3] & (1 << (ATT & 0x07)));
}
What if we instead load 8 bytes of the bitmap into a uint64 before entering
the loop, and shift an "index" mask into the bitmap by one each iteration
through the loop? Then we'd only need to do the access to bitmap every 64
attributes. I think we could also get rid of the hasnulls check in each
iteration that way, by just checking hasnull whenever we load portions of the
bitmap into a local variable, and loading "no-nulls" into it when !hasnulls.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Migowski | 2023-11-19 21:15:37 | AW: BUG #18205: Performance regression with NOT NULL checks. |
Previous Message | PG Bug reporting form | 2023-11-19 19:46:30 | BUG #18207: Turkiye LC Setting Error |