From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: POC: Cleaning up orphaned files using undo logs |
Date: | 2019-07-12 09:39:51 |
Message-ID: | CAA4eK1LisyNxwxyywzc7XrBL=LGnVswmJe1V9FkcJcY6F7h-Jw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jul 11, 2019 at 1:29 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> After some off-list discussion with Andres ...
>
> Another possible approach here, which I think I like better, is to
> switch from using a binary heap to using an rbtree. That wouldn't
> work well in DSM because of the way it uses pointers, but here we're
> putting data in the shared memory segment so it seems like it should
> work. The idea would be to allocate an array of entries with a
> freelist, and then have allocfunc and freefunc defined to push and pop
> the freelist. Unlike a binary heap, an rbtree lets us (a) do
> peek-ahead in sorted order and (b) delete elements from an arbitrary
> position without rebuilding anything.
>
> If we adopt this approach, then I think a bunch of the problems we've
> been talking about actually get a lot easier. If we pull an item from
> the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can
> remove it from the other one cheaply, because we can store a pointer
> to the RBTNode in the main object. So then we never have any stale
> pointers in any data structure, which means we don't have to have a
> strategy to avoid accidentally following them.
>
> The fact that we can peak-ahead correctly without any new code is also
> very nice. I'm still concerned that peeking ahead isn't the right
> approach in general, but if we're going to do it, peeking ahead to the
> actually-next-highest-priority item is a lot better than peeking ahead
> to some-item-that-may-be-fairly-high-priority.
>
> One problem which Andres spotted is that rbt_delete() can actually
> move content around, so if you just cache the RBTNode returned by
> rbt_insert(), it might not be the right one by the time you
> rbt_delete(), if other stuff has been deleted first. There are
> several possible approaches to that problem, but one that I'm
> wondering about is modifying rbt_delete_node() so that it doesn't rely
> on rbt_copy_data. The idea is that if y != z, instead of copying the
> data from y to z, copy the left/right/parent pointers from z into y,
> and make z's left, right, and parent nodes point to y instead.
>
I am not sure but don't we need to retain the color of z as well?
Apart from this, the duplicate key (ex. for size queues the size of
two requests can be same) handling might need some work. Basically,
either special combine function needs to be written (not sure yet what
we should do there) or we always need to ensure that the key is unique
like (size + start_urec_ptr). If the size is the same, then we can
decide based on start_urec_ptr.
I think we can go by changing the implementation to rbtree by doing
some enhancements instead of the binary heap or alternatively, we can
use one of the two ideas suggested by you in the email above [1] to
simplify the code and keep using the binary heap for now. Especially,
I like the below one.
"2. However, I don't think we should have a separate request object
for each queue anyway. We should insert pointers to the same objects
in all the relevant queue (either size + XID, or else error). So
instead of having three sets of objects, one for each queue, we'd just
have one set of objects and point to them with as many as two
pointers.
We'd therefore need LESS memory than we're using today, because we
wouldn't have separate arrays for XID, size, and error queue
elements."
I think even if we currently go with a binary heap, it will be
possible to change it to rbtree later, but I am fine either way.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Ideriha, Takeshi | 2019-07-12 09:46:15 | RE: Copy data to DSA area |
Previous Message | tushar | 2019-07-12 09:23:21 | Re: Minimal logical decoding on standbys |