From: | Andy Fan <zhihuifan1213(at)163(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Nikita Malakhov <hukutoc(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Shared detoast Datum proposal |
Date: | 2024-05-22 01:02:55 |
Message-ID: | 875xv6k3k1.fsf@163.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Robert,
> Andy Fan asked me off-list for some feedback about this proposal. I
> have hesitated to comment on it for lack of having studied the matter
> in any detail, but since I've been asked for my input, here goes:
Thanks for doing this! Since we have two totally different designs
between Tomas and me and I think we both understand the two sides of
each design, just that the following things are really puzzled to me.
- which factors we should care more.
- the possiblility to improve the design-A/B for its badness.
But for the point 2, we probably needs some prefer design first.
for now let's highlight that there are 2 totally different method to
handle this problem. Let's call:
design a: detoast the datum into slot->tts_values (and manage the
memory with the lifespan of *slot*). From me.
design b: detost the datum into a CACHE (and manage the memory with
CACHE eviction and at released the end-of-query). From Tomas.
Currently I pefer design a, I'm not sure if I want desgin a because a is
really good or just because design a is my own design. So I will
summarize the the current state for the easier discussion. I summarize
them so that you don't need to go to the long discussion, and I summarize
the each point with the link to the original post since I may
misunderstand some points and the original post can be used as a double
check.
> I agree that there's a problem here, but I suspect this is not the
> right way to solve it. Let's compare this to something like the
> syscache. Specifically, let's think about the syscache on
> pg_class.relname.
OK.
> In the case of the syscache on pg_class.relname, there are two reasons
> why we might repeatedly look up the same values in the cache. One is
> that the code might do multiple name lookups when it really ought to
> do only one. Doing multiple lookups is bad for security and bad for
> performance, but we have code like that in some places and some of it
> is hard to fix. The other reason is that it is likely that the user
> will mention the same table names repeatedly; each time they do, they
> will trigger a lookup on the same name. By caching the result of the
> lookup, we can make it much faster. An important point to note here is
> that the queries that the user will issue in the future are unknown to
> us. In a certain sense, we cannot even know whether the same table
> name will be mentioned more than once in the same query: when we reach
> the first instance of the table name and look it up, we do not have
> any good way of knowing whether it will be mentioned again later, say
> due to a self-join. Hence, the pattern of cache lookups is
> fundamentally unknowable.
True.
> But that's not the case in the motivating example that started this
> thread. In that case, the target list includes three separate
> references to the same column. We can therefore predict that if the
> column value is toasted, we're going to de-TOAST it exactly three
> times.
True.
> If, on the other hand, the column value were mentioned only
> once, it would be detoasted just once. In that latter case, which is
> probably quite common, this whole cache idea seems like it ought to
> just waste cycles, which makes me suspect that no matter what is done
> to this patch, there will always be cases where it causes significant
> regressions.
I agree.
> In the former case, where the column reference is
> repeated, it will win, but it will also hold onto the detoasted value
> after the third instance of detoasting, even though there will never
> be any more cache hits.
True.
> Here, unlike the syscache case, the cache is
> thrown away at the end of the query, so future queries can't benefit.
> Even if we could find a way to preserve the cache in some cases, it's
> not very likely to pay off. People are much more likely to hit the
> same table more than once than to hit the same values in the same
> table more than once in the same session.
I agree.
One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]
(including the quota text) highlight this idea.
> Suppose we had a design where, when a value got detoasted, the
> detoasted value went into the place where the toasted value had come
> from. Then this problem would be handled pretty much perfectly: the
> detoasted value would last until the scan advanced to the next row,
> and then it would be thrown out. So there would be no query-lifespan
> memory leak.
I think that is same idea as design a.
> Now, I don't really know how this would work, because the
> TOAST pointer is coming out of a slot and we can't necessarily write
> one attribute of any arbitrary slot type without a bunch of extra
> overhead, but maybe there's some way.
I think the key points includes:
1. Where to put the *detoast value* so that ExprEval system can find out
it cheaply. The soluation in A slot->tts_values[*].
2. How to manage the memory for holding the detoast value.
The soluation in A is:
- using a lazy created MemoryContext in slot.
- let the detoast system detoast the datum into the designed
MemoryContext.
3. How to let Expr evalution run the extra cycles when needed.
The soluation in A is:
EEOP_XXX_TOAST step is designed to get riding of some *run-time* *if
(condition)" . I found the existing ExprEval system have many
specialized steps.
I will post a detailed overall design later and explain the its known
badness so far.
> If it could be done cheaply enough, it would gain all the benefits of
> this proposal a lot more cheaply and with fewer
> downsides.
I agree.
> Alternatively, maybe there's a way
> to notice the multiple references and introduce some kind of
> intermediate slot or other holding area where the detoasted value can
> be stored.
>
> In other words, instead of computing
>
> big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'
>
> Maybe we ought to be trying to compute this:
>
> big_jsonb_col=detoast(big_jsonb_col)
> big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'
>
> Or perhaps this:
>
> tmp=detoast(big_jsonb_col)
> tmp->'a', tmp->'b', tmp->'c'
Current the design a doesn't use 'tmp' so that the existing ExprEval
engine can find out "detoast datum" easily, so it is more like:
> big_jsonb_col=detoast(big_jsonb_col)
> big_jsonb_col->'a', big_jsonb_col->'b', big_jsonb_col->'c'
[1] https://www.postgresql.org/message-id/87ttllbd00.fsf%40163.com
--
Best Regards
Andy Fan
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2024-05-22 01:12:49 | Re: Skip adding row-marks for non target tables when result relation is foreign table. |
Previous Message | Imseih (AWS), Sami | 2024-05-21 23:15:14 | Re: problems with "Shared Memory and Semaphores" section of docs |