Shared detoast Datum proposal

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>,Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>,Nikita Malakhov <hukutoc(at)gmail(dot)com>
Subject: Shared detoast Datum proposal
Date: 2023-12-27 10:45:25
Message-ID: 87il4jrk1l.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Problem:
--------
Toast works well for its claimed purposes. However the current detoast
infrastructure doesn't shared the detoast datum in the query's
lifespan, which will cause some inefficiency like this:

SELECT big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'
FROM t;

In the above case, the big_jsonb_col will be detoast 3 times for every
record.

a more common case maybe:

CREATE TABLE t(a numeric);
insert into t select i FROM generate_series(1, 10)i;
SELECT * FROM t WHERE a > 3;

Here we needs detoasted because of VARATT_CAN_MAKE_SHORT, and it needs to
be detoasted twice, one is in a > 3, the other one is in the targetlist,
where we need to call numeric_out.

Proposal
--------

When we access some desired toast column in EEOP_{SCAN|INNER_OUTER}_VAR
steps, we can detoast it immediately and save it back to
slot->tts_values. With this way, when any other expressions seeks for
the Datum, it get the detoast version. Planner decides which Vars
should use this feature, executor manages it detoast action and memory
management.

Planner design
---------------

1. This feature only happen at the Plan Node where the detoast would
happen in the previous topology, for example:

for example:

SELECT t1.toast_col, t2.toast_col FROM t1 join t2 USING (toast_col);

toast_col just happens at the Join node's slot even if we have a
projection on t1 or t2 at the scan node (except the Parameterized path).

However if

SELECT t1.toast_col, t2.toast_col
FROM t1 join t2
USING (toast_col)
WHERE t1.toast_col > 'a';

the detoast *may* happen at the scan of level t1 since "t1.toast_col >
'a'" accesses the Var within a FuncCall ('>' operator), which will
cause a detoast. (However it should not happen if it is under a Sort
node, for details see Planner Design section 2).

At the implementation side, I added "Bitmapset *reference_attrs;" to
Scan node which show if the Var should be accessed with the
pre-detoast way in expression execution engine. the value is
maintained at the create_plan/set_plan_refs stage.

Two similar fields are added in Join node.

Bitmapset *outer_reference_attrs;
Bitmapset *inner_reference_attrs;

In the both case, I tracked the level of walker/mutator, if the level
greater than 1 when we access a Var, the 'var->varattno - 1' is added
to the bitmapset. Some special node should be ignored, see
increase_level_for_pre_detoast for details.

2. We also need to make sure the detoast datum will not increase the
work_mem usage for the nodes like Sort/Hash etc, all of such nodes
can be found with search 'CP_SMALL_TLIST' flags.

If the a node under a Sort-Hash-like nodes, we have some extra
checking to see if a Var is a *directly input* of such nodes. If yes,
we can't detoast it in advance, or else, we know the Var has been
discarded before goes to these nodes, we still can use the shared
detoast feature.

The simplest cases to show this is:

For example:

2.1
Query-1
explain (verbose) select * from t1 where b > 'a';
-- b can be detoast in advance.

Query-2
explain (verbose) select * from t1 where b > 'a' order by c;
-- b can't be detoast since it will makes the Sort use more work_mem.

Query-3
explain (verbose) select a, c from t1 where b > 'a' order by c;
-- b can be pre-detoasted, since it is discarded before goes to Sort
node. In this case it doesn't do anything good, but for some complex
case like Query-4, it does.

Query-4
explain (costs off, verbose)
select t3.*
from t1, t2, t3
where t2.c > '999999999999999'
and t2.c = t1.c
and t3.b = t1.b;

QUERY PLAN
--------------------------------------------------------------
Hash Join
Output: t3.a, t3.b, t3.c
Hash Cond: (t3.b = t1.b)
-> Seq Scan on public.t3
Output: t3.a, t3.b, t3.c
-> Hash
Output: t1.b
-> Nested Loop
Output: t1.b <-- Note here 3
-> Seq Scan on public.t2
Output: t2.a, t2.b, t2.c
Filter: (t2.c > '9999...'::text) <--- Note Here 1
-> Index Scan using t1_c_idx on public.t1
Output: t1.a, t1.b, t1.c
Index Cond: (t1.c = t2.c) <--- Note Here 2
(15 rows)

In this case, detoast datum for t2.c can be shared and it benefits for
t2.c = t1.c and no harm for the Hash node.

Execution side
--------------

Once we decide a Var should be pre-detoasted for a given plan node, a
special step named as EEOP_{INNER/OUTER/SCAN}_VAR_TOAST will be created
during ExecInitExpr stage. The special steps are introduced to avoid its
impacts on the non-related EEOP_{INNER/OUTER/SCAN}_VAR code path.

slot->tts_values is used to store the detoast datum so that any other
expressions can access it pretty easily.

Because of the above design, the detoast datum should have a same
lifespan as any other slot->tts_values[*], so the default
ecxt_per_tuple_memory is not OK for this. At last I used slot->tts_mcxt
to hold the memory, and maintaining these lifecycles in execTuples.c. To
know which datum in slot->tts_values is pre-detoasted, Bitmapset *
slot->detoast_attrs is introduced.

During the execution of these steps, below code like this is used:

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
(struct varlena *) oldDatum));
Assert(slot->tts_nvalid > attnum);
if (oldDatum != slot->tts_values[attnum])
slot->pre_detoasted_attrs = bms_add_member(slot->pre_detoasted_attrs, attnum);
MemoryContextSwitchTo(old);
}
}

Testing
-------
- shared_detoast_slow.sql is used to test the planner related codes changes.

'set jit to off' will enable more INFO logs about which Var is
pre-detoast in which node level.

- the cases the patch doesn't help much.

create table w(a int, b numeric);
insert into w select i, i from generate_series(1, 1000000)i;

Q3:
select a from w where a > 0;

Q4:
select b from w where b > 0;

pgbench -n -f 1.sql postgres -T 10 -M prepared

run 5 times and calculated the average value.

| Qry No | Master | patched | perf | comment |
|--------+---------+---------+-------+-----------------------------------------|
| 3 | 309.438 | 308.411 | 0 | nearly zero impact on them |
| 4 | 431.735 | 420.833 | +2.6% | save the detoast effort for numeric_out |

- good case

setup:

create table b(big jsonb);
insert into b select
jsonb_object_agg( x::text,
random()::text || random()::text || random()::text )
from generate_series(1,600) f(x);
insert into b select (select big from b) from generate_series(1, 1000)i;

workload:
Q1:
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

Q2:
select 1 from b where length(big->>'1') > 0 and length(big->>'2') > 2;

| No | Master | patched |
|----+--------+---------|
| 1 | 1677 | 357 |
| 2 | 638 | 332 |

Some Known issues:
------------------

1. Currently only Scan & Join nodes are considered for this feature.
2. JIT is not adapted for this purpose yet.
3. This feature builds a strong relationship between slot->tts_values
and slot->pre_detoast_attrs. for example slot->tts_values[1] holds a
detoast datum, so 1 is in the slot->pre_detoast_attrs bitmapset,
however if someone changes the value directly, like
"slot->tts_values[1] = 2;" but without touching the slot's
pre_detoast_attrs then troubles comes. The good thing is the detoast
can only happen on Scan/Join nodes. so any other slot is impossible
to have such issue. I run the following command to find out such
cases, looks none of them is a slot form Scan/Join node.

egrep -nri 'slot->tts_values\[[^\]]*\] = *

Any thought?

--
Best Regards
Andy Fan

Attachment Content-Type Size
v1-0001-shared-detoast-feature.patch text/x-diff 72.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-12-27 10:46:23 Re: Track in pg_replication_slots the reason why slots conflict?
Previous Message Amit Kapila 2023-12-27 10:42:53 Re: Synchronizing slots from primary to standby