From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
Subject: | Re: B-tree parent pointer and checkpoints |
Date: | 2011-03-10 20:50:29 |
Message-ID: | 201103102050.p2AKoTr03570@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Bruce Momjian wrote:
>
> Has this been addressed?
I see we have with this commit:
9de3aa65f01fb51cbc725e8508ea233e4e92c46c
---------------------------------------------------------------------------
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
> > On 13.11.2010 00:34, Greg Stark wrote:
> > > On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
> > > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > >> I think we can work around that with a small modification to the page split
> > >> algorithm. In a nutshell, when the child page is split, put a flag on the
> > >> left half indicating that the rightlink must always be followed, regardless
> > >> of the NSN. When the downlink is inserted to the parent, clear the flag.
> > >> Setting and clearing of these flags need to be performed during WAL replay
> > >> as well.
> > >
> > > Does this not cause duplicate results? Or does GIST already have to be
> > > prepared to deal with duplicate results?
> >
> > The GiST search algorithm avoids duplicate results by remembering the
> > LSN on the parent page when it follows a downlink. The split currently
> > happens like this:
> >
> > 0. (the child page is locked)
> > 1. The parent page is locked.
> > 2. The child page is split. The original page becomes the left half, and
> > new buffers are allocated for the right halves.
> > 3. The downlink is inserted on the parent page (and the original
> > downlink is updated to reflect only the keys that stayed on the left
> > page). While keeping the child pages locked, the NSN field on the
> > children are updated with the new LSN of the parent page.
> >
> > To avoid duplicates, when a scan looks at the child page, it needs to
> > know if it saw the parent page before or after the downlink was
> > inserted. If it saw it before, the scan needs to follow the rightlink to
> > the right half, otherwise it will follow the downlink as usual (if it
> > matched). The scan checks that by comparing the LSN it saw on the parent
> > page with the NSN on the child page. If parent LSN < NSN, we saw the
> > parent before the downlink was inserted.
> >
> > Now, the problem with crash recovery is that the above algorithm depends
> > on the split to keep the parent and child locked until the downlink is
> > inserted in the parent. If you crash between steps 2 and 3, the locks
> > are gone. If a later insert then updates the parent page, because of a
> > split on some unrelated child page, that will bump the LSN of the parent
> > above the NSN on the child. Scans will see that the parent LSN > child
> > NSN, and will no longer follow the rightlink.
> >
> > And the fix for that is to set a flag on the child page indicating that
> > rightlink has to be always followed regardless of the LSN/NSN, because
> > the downlink hasn't been inserted yet. When the downlink is inserted,
> > the flag is cleared and we rely on the existing LSN/NSN mechanism to
> > avoid duplicate results.
> >
> > --
> > Heikki Linnakangas
> > EnterpriseDB http://www.enterprisedb.com
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + It's impossible for everything to be true. +
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2011-03-10 20:50:48 | Re: Sync Rep v19 |
Previous Message | Andrew Dunstan | 2011-03-10 20:48:06 | Re: configure gaps |