From: | Andy Fan <zhihuifan1213(at)163(dot)com> |
---|---|
To: | Andy Fan <zhihuifan1213(at)163(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Shared detoast Datum proposal |
Date: | 2024-02-26 02:04:14 |
Message-ID: | 87jzms0xza.fsf@163.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
>>
>> Good idea. Either that (a separate README), or a comment in a header of
>> some suitable .c/.h file (I prefer that, because that's kinda obvious
>> when reading the code, I often not notice a README exists next to it).
>
> Great, I'd try this from tomorrow.
I have made it. Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code. Hope this is helpful!
The problem:
-------------
In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.
SELECT jb_col->'a', jb_col->'b' FROM t;
Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.
Current Design
--------------
The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:
- The existing expression engine read datum from tts_values[*], no any
extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
is natural to think the detoast datum is a tts_value just that it is
in a detoast format. Since we have a clear lifespan for TupleTableSlot
already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
automatically to the next node's slot, but keeping the ownership
unchanged, so only the slot where the detoast really happen take the
charge of it's lifespan.
So the final data change is adding the below field into TubleTableSlot.
typedef struct TupleTableSlot
{
..
/*
* The attributes whose values are the detoasted version in tts_values[*],
* if so these memory needs some extra clean-up. These memory can't be put
* into ecxt_per_tuple_memory since many of them needs a longer life
* span.
*
* These memory is put into TupleTableSlot.tts_mcxt and be clear
* whenever the tts_values[*] is invalidated.
*/
Bitmapset *pre_detoast_attrs;
};
Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.
static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
(struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}
}
Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.
When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.
Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:
1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
detoast datum back to tts_values[*], which make tuple bigger. if we
do this blindly, it would be harmful to the ORDER / HASH style nodes.
A high level data flow is:
1. at the createplan.c, we walk the plan tree go gather the
CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
it to Plan.forbid_pre_detoast_vars via the function
set_plan_forbid_pre_detoast_vars_recurse.
2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
expression, so it is a good time to track the attribute number and
see if the Var is directly or indirectly accessed. Usually the
indirectly access a Var means a detoast would happens, for
example an expression like a > 3. However some known expressions like
VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;
As a result, the final result is added into the plan node of Scan and
Join.
typedef struct Scan
{
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *reference_attrs;
} Scan;
typedef struct Join
{
..
/*
* Records of var's varattno - 1 where the Var is accessed indirectly by
* any expression, like a > 3. However a IS [NOT] NULL is not included
* since it doesn't access the tts_values[*] at all.
*
* This is a essential information to figure out which attrs should use
* the pre-detoast-attrs logic.
*/
Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;
} Join;
Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.
3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
in ScanState & JoinState, which will be passed into expression
engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
_VAR_TOAST steps are generated finally then everything is connected
with ExecSlotDetoastDatum!
Testing
-------
Case 1:
create table t (a numeric);
insert into t select i from generate_series(1, 100000)i;
cat 1.sql
select * from t where a > 0;
In this test, the current master run detoast twice for each datum. one
in numeric_gt, one in numeric_out. this feature makes the detoast once.
pgbench -f 1.sql -n postgres -T 10 -M prepared
master: 30.218 ms
patched(Bitmapset): 30.881ms
Then we can see the perf report as below:
- 7.34% 0.00% postgres postgres [.] ExecSlotDetoastDatum (inlined)
- ExecSlotDetoastDatum (inlined)
- 3.47% bms_add_member
- 3.06% bms_make_singleton (inlined)
- palloc0
1.30% AllocSetAlloc
- 5.99% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined)
- ExecFreePreDetoastDatum (inlined)
2.64% bms_next_member
1.17% bms_del_members
0.94% AllocSetFree
One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 100000 times in the above workload.
Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms
- 5.22% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined)
- ExecFreePreDetoastDatum (inlined)
- 2.82% bitset_next_member
1.69% bms_next_member_internal (inlined)
0.95% bitset_next_member
0.66% AllocSetFree
Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms.
So the final result for case 1:
master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms
Big jsonbs test:
create table b(big jsonb);
insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);
insert into b select (select big from b) from generate_series(1, 1000)i;
explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;
master: 702.224 ms
patched: 133.306 ms
Memory usage test:
I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.
In progress work:
I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:
master: 55s
patched: 56s
The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.
--
Best Regards
Andy Fan
Attachment | Content-Type | Size |
---|---|---|
v8-0001-Shared-detoast-feature.patch | text/x-diff | 74.0 KB |
v8-0002-Introduce-a-Bitset-data-struct.patch | text/x-diff | 16.5 KB |
v8-0003-Use-bitset-instead-of-Bitmapset-for-pre_detoast_a.patch | text/x-diff | 4.7 KB |
v8-0004-Adding-tts_value_mctx-to-TupleTableSlot.patch | text/x-diff | 8.7 KB |
v8-0005-Provide-a-building-option-for-review-purpose.patch | text/x-diff | 2.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Zhijie Hou (Fujitsu) | 2024-02-26 02:18:58 | RE: Synchronizing slots from primary to standby |
Previous Message | Michael Paquier | 2024-02-26 01:08:40 | Re: RangeTblEntry jumble omissions |