From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
Subject: | Re: B-tree parent pointer and checkpoints |
Date: | 2010-11-11 18:12:21 |
Message-ID: | 4CDC3205.6030001@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 11.11.2010 17:16, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> GiST is different. When you insert a key to a leaf page, you (sometimes)
>> need to adjust the parent pointer to reflect the new key as well. B-tree
>> tolerates incomplete splits with the 'next page' pointer, but that is
>> not applicable to gist. Teodor described the issue back in 2005 when
>> WAL-logging was added to GiST
>> (http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php)
>> Reading that I wonder: what harm would an incomplete insert cause if we
>> just left it in the tree? Imagine that you insert a key to a leaf page,
>> but crash before updating the parent. If you search for the key starting
>> from the root, you'll fail to find it, because the parent pointer claims
>> that there are no entries with such a key on the child page. But that's
>> OK, the inserting transaction aborted with the crash!
>
> I think it'd be okay as far as that one entry is concerned, since as you
> say it doesn't matter whether a search finds it. (We'd have to be sure
> that VACUUM would still find it to remove it, of course, but that
> doesn't use a normal search.) You're right that it poses a hazard of
> subsequent inserts deciding that they don't need to do work on upper
> levels because the lower ones look OK already. But depending on the
> details of the search algorithm, this might be a non-problem: if you
> remember that the upper level entries didn't cover your key when you
> descended, you'd still know you need to recompute them.
Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.
> Something else I just noticed is that WAL replay isn't capable of
> completely fixing the index anyway:
>
> * To complete insert we can't use basic insertion algorithm because
> * during insertion we can't call user-defined support functions of opclass.
> * So, we insert 'invalid' tuples without real key and do it by separate algorithm.
> * 'invalid' tuple should be updated by vacuum full.
>
> Given that there's no more vacuum full, and nobody has been expected to
> run it routinely for a long time anyway, this fixup approach seems
> pretty completely broken anyhow.
The 'invalid' tuples don't affect correctness, but are a drag on
performance, so they are similar to incomplete b-tree splits. I suspect
the overhead of an invalid gist pointer is much bigger than the overhead
of an incomplete b-tree split, though. I agree we should get rid of
that, it's not comforting to get a stream of messages in the logs saying
you should run VACUUM FULL.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Darren Duncan | 2010-11-11 18:19:08 | Re: MULTISET and additional functions for ARRAY |
Previous Message | David E. Wheeler | 2010-11-11 18:11:24 | Re: MULTISET and additional functions for ARRAY |