Re: Shared detoast Datum proposal

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Nikita Malakhov <hukutoc(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Shared detoast Datum proposal
Date: 2024-05-23 00:27:41
Message-ID: 87ikz5i9m1.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Robert Haas <robertmhaas(at)gmail(dot)com> writes:

> On Tue, May 21, 2024 at 10:02 PM Andy Fan <zhihuifan1213(at)163(dot)com> wrote:
>> One more things I want to highlight it "syscache" is used for metadata
>> and *detoast cache* is used for user data. user data is more
>> likely bigger than metadata, so cache size control (hence eviction logic
>> run more often) is more necessary in this case. The first graph in [1]
>> (including the quota text) highlight this idea.
>
> Yes.
>
>> I think that is same idea as design a.
>
> Sounds similar.
>

This is really great to know!

>> I think the key points includes:
>>
>> 1. Where to put the *detoast value* so that ExprEval system can find out
>> it cheaply. The soluation in A slot->tts_values[*].
>>
>> 2. How to manage the memory for holding the detoast value.
>>
>> The soluation in A is:
>>
>> - using a lazy created MemoryContext in slot.
>> - let the detoast system detoast the datum into the designed
>> MemoryContext.
>>
>> 3. How to let Expr evalution run the extra cycles when needed.
>
> I don't understand (3) but I agree that (1) and (2) are key points. In
> many - nearly all - circumstances just overwriting slot->tts_values is
> unsafe and will cause problems. To make this work, we've got to find
> some way of doing this that is compatible with general design of
> TupleTableSlot, and I think in that API, just about the only time you
> can touch slot->tts_values is when you're trying to populate a virtual
> slot. But often we'll have some other kind of slot. And even if we do
> have a virtual slot, I think you're only allowed to manipulate
> slot->tts_values up until you call ExecStoreVirtualTuple.

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is. So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed.

Aside from efficiency considerations, security and rationality are also
important considerations. So what I think now is:

- tts_values[*] means the Datum from the slot, and it has its operations
like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot,
ExecClearTuple and so on. All of the characters have nothing with what
kind of slot it is.

- Saving the "detoast datum" version into tts_values[*] doesn't break
the above semantic and the ExprEval engine just get a detoast version
so it doesn't need to detoast it again.

- The keypoint is the memory management and effeiciency. for now I think
all the places where "slot->tts_nvalid" is set to 0 means the
tts_values[*] is no longer validate, then this is the place we should
release the memory for the "detoast datum". All the other places like
ExecMaterializeSlot or ExecCopySlot also need to think about the
"detoast datum" as well.

> That's why I mentioned using something temporary. You could legally
> (a) materialize the existing slot, (b) create a new virtual slot, (c)
> copy over the tts_values and tts_isnull arrays, (c) detoast any datums
> you like from tts_values and store the new datums back into the array,
> (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
> of the original one. That would avoid repeated detoasting, but it
> would also have a bunch of overhead.

Yes, I agree with the overhead, so I perfer to know if the my current
strategy has principle issue first.

> Having to fully materialize the
> existing slot is a cost that you wouldn't always need to pay if you
> didn't do this, but you have to do it to make this work. Creating the
> new virtual slot costs something. Copying the arrays costs something.
> Detoasting costs quite a lot potentially, if you don't end up using
> the values. If you end up needing a heap tuple representation of the
> slot, you're going to now have to make a new one rather than just
> using the one that came straight from the buffer.

> But I do agree with you
> that *if* we could make it work, it would probably be better than a
> longer-lived cache.

I noticed the "if" and great to know that:), speically for the efficiecy
& memory usage control purpose.

Another big challenge of this is how to decide if a Var need this
pre-detoasting strategy, we have to consider:

- The in-place tts_values[*] update makes the slot bigger, which is
harmful for CP_SMALL_TLIST (createplan.c) purpose.
- ExprEval runtime checking if the Var is toastable is an overhead. It
is must be decided ExecInitExpr time.

The code complexity because of the above 2 factors makes people (include
me) unhappy.

===
Yesterday I did some worst case testing for the current strategy, and
the first case show me ~1% *unexpected* performance regression and I
tried to find out it with perf record/report, and no lucky. that's the
reason why I didn't send a complete post yesterday.

As for now, since we are talking about the high level design, I think
it is OK to post the improved design document and the incompleted worst
case testing data and my planning.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
is natural to think the detoast datum is a tts_value just that it is
in a detoast format. Since we have a clear lifespan for TupleTableSlot
already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
automatically to the next node's slot, but keeping the ownership
unchanged, so only the slot where the detoast really happen take the
charge of it's lifespan.

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The following strategy is used to
reduce the run time overhead.

1. Putting the detoast datum into tts->tts_values[*]. then the ExprEval
system doesn't need any extra effort to find them.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
if (unlikely(slot->tts_data_mctx == NULL))
slot->tts_data_mctx = GenerationContextCreate(slot->tts_mcxt,
"tts_value_ctx",
ALLOCSET_START_SMALL_SIZES);
slot->tts_values[attnum] = PointerGetDatum(detoast_attr_ext((struct varlena *) slot->tts_values[attnum],
/* save the detoast value to the given MemoryContext. */
slot->tts_data_mctx));
}
}

2. Using a dedicated lazy created memory context under TupleTableSlot so
that the memory can be released effectively.

static inline void
ExecFreePreDetoastDatum(TupleTableSlot *slot)
{
if (slot->tts_data_mctx)
MemoryContextResetOnly(slot->tts_data_mctx);
}

3. Control the memory usage by releasing them whenever the
slot->tts_values[*] is deemed as invalidate, so the lifespan is
likely short.

4. Reduce the run-time overhead for checking if a VAR should use
pre-detoast feature, 3 new steps are introduced.
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST, so that other Var is totally non
related.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression in the previous logic.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
detoast datum back to tts_values[*], which make tuple bigger. if we
do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
it to Plan.forbid_pre_detoast_vars via the function
set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
expression, so it is a good time to track the attribute number and
see if the Var is directly or indirectly accessed. Usually the
indirectly access a Var means a detoast would happens, for
example an expression like a > 3. However some known expressions is
ignored. for example: NullTest, pg_column_compression which needs the
raw datum, start_with/sub_string which needs a slice
detoasting. Currently there is some hard code here, we may needs a
pg_proc.detoasting_requirement flags to make this generic. The
output is {Scan|Join}.xxx_reference_attrs;

Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
in ScanState & JoinState, which will be passed into expression
engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
_VAR_TOAST steps are generated finally then everything is connected
with ExecSlotDetoastDatum!

Bad case testing of current design:
====================================

- The extra effort of createplan.c & setref.c & execExpr.c & InitPlan :

CREATE TABLE t(a int, b numeric);

q1:
explain select * from t where a > 3;

q2:
set enable_hashjoin to off;
set enable_mergejoin to off;
set enable_nestloop to on;

explain select * from t join t t2 using(a);

q3:
set enable_hashjoin to off;
set enable_mergejoin to on;
set enable_nestloop to off;
explain select * from t join t t2 using(a);

q4:
set enable_hashjoin to on;
set enable_mergejoin to off;
set enable_nestloop to off;
explain select * from t join t t2 using(a);

pgbench -f x.sql -n postgres -M simple -T10

| query ID | patched (ms) | master (ms) | regression(%) |
|----------+-------------------------------------+-------------------------------------+---------------|
| 1 | {0.088, 0.088, 0.088, 0.087, 0.089} | {0.087, 0.086, 0.086, 0.087, 0.087} | 1.14% |
| 2 | not-done-yet | | |
| 3 | not-done-yet | | |
| 4 | not-done-yet | | |

- The extra effort of ExecInterpExpr

insert into t select i, i FROM generate_series(1, 100000) i;

SELECT count(b) from t WHERE b > 0;

In this query, the extra execution code is run but it doesn't buy anything.

| master | patched | regression |
|--------+---------+------------|
| | | |

What I am planning now is:
- Gather more feedback on the overall strategy.
- figure out the 1% unexpected regression. I don't have a clear
direction for now, my current expression is analyzing the perf report,
and I can't find out the root cause. and reading the code can't find out
the root cause as well.
- Simplifying the "planner part" for deciding which Var should use this
strategy. I don't have clear direction as well.
- testing and improving more worst case.

Attached is a appliable and testingable version against current master
(e87e73245).

--
Best Regards
Andy Fan

Attachment Content-Type Size
v10-0001-shared-detoast-datum.patch text/x-diff 95.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-05-23 01:23:42 Speed up JSON escape processing with SIMD plus other optimisations
Previous Message Michael Paquier 2024-05-23 00:21:32 Re: Avoid possible dereference null pointer (src/backend/catalog/pg_depend.c)