Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-16 22:07:22
Message-ID: 20190716220722.byg3u3bxfoqczxkc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-07-13 15:55:51 +0530, Amit Kapila wrote:
> On Fri, Jul 12, 2019 at 7:08 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > > I think even if we currently go with a binary heap, it will be
> > > possible to change it to rbtree later, but I am fine either way.
> >
> > Well, I don't see much point in revising all of this logic twice. We
> > should pick the way we want it to work and make it work that way.
> >
>
> Yeah, I agree. So, I am assuming here that as you have discussed this
> idea with Andres offlist, he is on board with changing it as he has
> originally suggested using binary_heap. Andres, do let us know if you
> think differently here. It would be good if anyone else following the
> thread can also weigh in.

Yes, I think using an rbtree makes sense.

I'm not yet sure whether we'd want the rbtree nodes being pointed to
directly by the hashtable, or whether we'd want one indirection.

e.g. either something like:

typedef struct UndoWorkerQueue
{
/* priority ordered tree */
RBTree *tree;
....
}

typedef struct UndoWorkerQueueEntry
{
RBTNode tree_node;

/*
* Reference hashtable via key, not pointers, entries might be
* moved.
*/
RollbackHashKey rollback_key
...
} UndoWorkerQueueEntry;

typedef struct RollbackHashEntry
{
...
UndoWorkerQueueEntry *queue_memb_size;
UndoWorkerQueueEntry *queue_memb_age;
UndoWorkerQueueEntry *queue_memb_error;
}

and call rbt_delete() for any non-NULL queue_memb_* whenever an entry is
dequeued via one of the queues (after setting the one already dequeued
from to NULL, of course). Which requires - as Robert mentioned - that
rbtree pointers remain stable after insertions.

Alternatively we can have a more complicated arrangement without the
"stable pointer" requirement (which'd also similarly work for a binary
heap):

typedef struct UndoWorkerQueue
{
/* information about work needed, not meaningfully ordered */
UndoWorkerQueueEntry *entries;

/*
* Priority ordered references into 0<entries, using
* UndoWorkerQueueTreeEntry as members.
*/
RBTree tree;

/* unused elements in ->entries, UndoWorkerQueueEntry members */
slist_head freelist;

/*
* Number of entries in ->entries and tree that can be pruned by
* doing a scan of both.
*/
int num_prunable_entries;
}

typedef struct UndoWorkerQueueEntry
{
/*
* Reference hashtable via key, not pointers, entries might be
* moved.
*/
RollbackHashKey rollback_key

/*
* As members of UndoWorkerQueue->tree can be moved in memory,
* RollbackHashEntry cannot directly point to them. Instead
*/
bool already_processed;
...

slist_node freelist_node;
} UndoWorkerQueueEntry;

typedef struct UndoWorkerQueueTreeEntry
{
RBTree tree;

/* offset into UndoWorkerQueue->entries */
int off;
} UndoWorkerQueueEntry;

and again

typedef struct RollbackHashEntry
{
RBTNode tree_node;
...
UndoWorkerQueueEntry *queue_memb_size;
UndoWorkerQueueEntry *queue_memb_age;
UndoWorkerQueueEntry *queue_memb_error;
}

Because the tree entries are not members of the tree itself, pointers to
them would be stable, regardless of rbtree (or binary heap) moving them
around. The cost of that would be more complicated datastructures, and
insertion/deletion/dequeing operations:

insertion:
if (slist_is_empty(&queue->freelist))
prune();
if (slist_is_empty(&queue->freelist))
elog(ERROR, "full")

UndoWorkerQueueEntry *entry = slist_pop_head_node(&queue->freelist)
UndoWorkerQueueTreeEntry tree_entry;

entry->already_processed = false;
entry->... = ...;

tree_entry.off = entry - queue->entries; // calculate offset
rbt_insert(queue->tree, &tree_entry, NULL);

prune:
if (queue->num_prunable_entries > 0)
RBTreeIterator iter;
slist_node *pending_freelist;
rbt_begin_iterate(queue->tree, &iter, LeftRightWalk);
while ((tnode = rbt_iterate(&iter)) != 0)
node = (UndoWorkerQueueTreeEntry *) tnode;
if (queue->entries[node->off]->already_processed)
rbt_delete(tnode);
/* XXX: Have to stop here, the iterator is invalid -
* probably should add a rbt_delete_current(iterator);
*/
break;

dequeue:
while (node = rbt_leftmost(queue->tree))
node = (UndoWorkerQueueTreeEntry *) tnode;
entry = &queue->entries[node->off];

rbt_delete(tnode);

/* check if the entry has already been processed via another queue */
if (entry->already_processed)
slist_push(&queue->freelist, &entry->freelist_node);
else
/* found it */
return entry;
return NULL;

delete (i.e. processed in another queue):

/*
* Queue entry will only be reusable when the corresponding tree
* entry has been removed. That'll happen either when new entries
* are needed (cf prune), or when the entry is dequeued (cf dequeue).
*/
entry->already_processed = true;

I think the first approach is clearly preferrable from a simplicity POV,
but the second approach would be a bit more generic (applicable to heap
as well) and wouldn't require adjusting the rbtree code.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-07-16 22:18:20 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Morris de Oryx 2019-07-16 21:56:22 Re: Detailed questions about pg_xact_commit_timestamp