Re: Shared detoast Datum proposal

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Shared detoast Datum proposal
Date: 2024-02-21 12:39:29
Message-ID: 7f4c8673-4c5d-4dbf-b8e5-0ec11cd2fba0@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2/20/24 19:38, Andy Fan wrote:
>
> ...
>
>> I think it should be done the other
>> way around, i.e. the patch should introduce the main feature first
>> (using the traditional Bitmapset), and then add Bitset on top of that.
>> That way we could easily measure the impact and see if it's useful.
>
> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
> indicate it is too expensive. and after talk with David at [2], I
> introduced bitset and use it here. the test case I used comes from [1].
> IRCC, there were 5% performance difference because of this.
>
> create table w(a int, b numeric);
> insert into w select i, i from generate_series(1, 1000000)i;
> select b from w where b > 0;
>
> To reproduce the difference, we can replace the bitset_clear() with
>
> bitset_free(slot->pre_detoasted_attrs);
> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
>
> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
>

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

>
>> On the whole, my biggest concern is memory usage & leaks. It's not
>> difficult to already have problems with large detoasted values, and if
>> we start keeping more of them, that may get worse. Or at least that's my
>> intuition - it can't really get better by keeping the values longer, right?
>>
>> The other thing is the risk of leaks (in the sense of keeping detoasted
>> values longer than expected). I see the values are allocated in
>> tts_mcxt, and maybe that's the right solution - not sure.
>
> about the memory usage, first it is kept as the same lifesplan as the
> tts_values[*] which can be released pretty quickly, only if the certain
> values of the tuples is not needed. it is true that we keep the detoast
> version longer than before, but that's something we have to pay I
> think.
>
> Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
> we forget to release the memory when the tts_values[*] is invalidated
> somehow, the memory will be leaked until the end of executor. I think
> that will be enough to cause an issue. Currently besides I release such
> memory at the ExecClearTuple, I also relase such memory whenever we set
> tts_nvalid to 0, the theory used here is:
>
> /*
> * tts_values is treated invalidated since tts_nvalid is set to 0, so
> * let's free the pre-detoast datum.
> */
> ExecFreePreDetoastDatum(slot);
>
> I will do more test on the memory leak stuff, since there are so many
> operation aginst slot like ExecCopySlot etc, I don't know how to test it
> fully. the method in my mind now is use TPCH with 10GB data size, and
> monitor the query runtime memory usage.
>

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

>
>> FWIW while looking at the patch, I couldn't help but to think about
>> expanded datums. There's similarity in what these two features do - keep
>> detoasted values for a while, so that we don't need to do the expensive
>> processing if we access them repeatedly.
>
> Could you provide some keyword or function names for the expanded datum
> here, I probably miss this.
>

see src/include/utils/expandeddatum.h

>> Of course, expanded datums are
>> not meant to be long-lived, while "shared detoasted values" are meant to
>> exist (potentially) for the query duration.
>
> hmm, acutally the "shared detoast value" just live in the
> TupleTableSlot->tts_values[*], rather than the whole query duration. The
> simple case is:
>
> SELECT * FROM t WHERE a_text LIKE 'abc%';
>
> when we scan to the next tuple, the detoast value for the previous tuple
> will be relased.
>

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

>> But maybe there's something
>> we could learn from expanded datums? For example how the varlena pointer
>> is leveraged to point to the expanded object.
>
> maybe. currently I just use detoast_attr to get the desired version. I'm
> pleasure if we have more effective way.
>
> if (!slot->tts_isnull[attnum] &&
> VARATT_IS_EXTENDED(slot->tts_values[attnum]))
> {
> Datum oldDatum;
> MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
>
> oldDatum = slot->tts_values[attnum];
> slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum));
> Assert(slot->tts_nvalid > attnum);
> Assert(oldDatum != slot->tts_values[attnum]);
> bitset_add_member(slot->pre_detoasted_attrs, attnum);
> MemoryContextSwitchTo(old);
> }
>

Right. FWIW I'm not sure if expanded objects are similar enough to be a
useful inspiration.

Unrelated question - could this actually end up being too aggressive?
That is, could it detoast attributes that we end up not needing? For
example, what if the tuple gets eliminated for some reason (e.g. because
of a restriction on the table, or not having a match in a join)? Won't
we detoast the tuple only to throw it away?

>
>> For example, what if we add a "TOAST cache" as a query-level hash table,
>> and modify the detoasting to first check the hash table (with the TOAST
>> pointer as a key)? It'd be fairly trivial to enforce a memory limit on
>> the hash table, evict values from it, etc. And it wouldn't require any
>> of the createplan/setrefs changes, I think ...
>
> Hmm, I am not sure I understand you correctly at this part. In the
> current patch, to avoid the run-time (ExecExprInterp) check if we
> should detoast and save the datum, I defined 3 extra steps so that
> the *extra check itself* is not needed for unnecessary attributes.
> for example an datum for int or a detoast datum should not be saved back
> to tts_values[*] due to the small_tlist reason. However these steps can
> be generated is based on the output of createplan/setrefs changes. take
> the INNER_VAR for example:
>
> In ExecInitExprRec:
>
> switch (variable->varno)
> {
> case INNER_VAR:
> if (is_join_plan(plan) &&
> bms_is_member(attnum,
> ((JoinState *) state->parent)->inner_pre_detoast_attrs))
> {
> scratch.opcode = EEOP_INNER_VAR_TOAST;
> }
> else
> {
> scratch.opcode = EEOP_INNER_VAR;
> }
> }
>
> The data workflow is:
>
> 1. set_plan_forbid_pre_detoast_vars_recurse (in the createplan.c)
> decides which Vars should *not* be pre_detoasted because of small_tlist
> reason and record it in Plan.forbid_pre_detoast_vars.
>
> 2. fix_scan_expr (in the setrefs.c) tracks which Vars should be
> detoasted for the specific plan node and record them in it. Currently
> only Scan and Join nodes support this feature.
>
> typedef struct Scan
> {
> ...
> /*
> * Records of var's varattno - 1 where the Var is accessed indirectly by
> * any expression, like a > 3. However a IS [NOT] NULL is not included
> * since it doesn't access the tts_values[*] at all.
> *
> * This is a essential information to figure out which attrs should use
> * the pre-detoast-attrs logic.
> */
> Bitmapset *reference_attrs;
> } Scan;
>
> typedef struct Join
> {
> ..
> /*
> * Records of var's varattno - 1 where the Var is accessed indirectly by
> * any expression, like a > 3. However a IS [NOT] NULL is not included
> * since it doesn't access the tts_values[*] at all.
> *
> * This is a essential information to figure out which attrs should use
> * the pre-detoast-attrs logic.
> */
> Bitmapset *outer_reference_attrs;
> Bitmapset *inner_reference_attrs;
> } Join;
>

Is it actually necessary to add new fields to these nodes? Also, the
names are not particularly descriptive of the purpose - it'd be better
to have "detoast" in the name, instead of generic "reference".

> 3. during the InitPlan stage, we maintain the
> PlanState.xxx_pre_detoast_attrs and generated different StepOp for them.
>
> 4. At the ExecExprInterp stage, only the new StepOp do the extra check
> to see if the detoast should happen. Other steps doesn't need this
> check at all.
>
> If we avoid the createplan/setref.c changes, probabaly some unrelated
> StepOp needs the extra check as well?
>
> When I worked with the UniqueKey feature, I maintained a
> UniqueKey.README to summaried all the dicussed topics in threads, the
> README is designed to save the effort for more reviewer, I think I
> should apply the same logic for this feature.
>

Good idea. Either that (a separate README), or a comment in a header of
some suitable .c/.h file (I prefer that, because that's kinda obvious
when reading the code, I often not notice a README exists next to it).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2024-02-21 13:20:38 Re: Shared detoast Datum proposal
Previous Message Bertrand Drouvot 2024-02-21 12:25:25 Re: Introduce XID age and inactive timeout based replication slot invalidation