Re: pgsql: Optimize btree insertions for common case of increasing values

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-committers <pgsql-committers(at)lists(dot)postgresql(dot)org>
Subject: Re: pgsql: Optimize btree insertions for common case of increasing values
Date: 2018-04-04 17:16:50
Message-ID: CAH2-Wzm_i6TkxppTsT7ZwkGBfaa9fgZrFaW99909-+qYy2BFUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

On Wed, Apr 4, 2018 at 5:33 AM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
> Ok. Adding a check for tree height and setting target block only if it's >=
> 2, as suggested by you and Simon later. Rahila helped me also ran another
> round of tests and this does not lead to any performance regression (we were
> worried about whether calling _bt_getrootheight will be expensive).

Right.

> Also moved RelationSetTargetBlock() to a point where we are not holding any
> locks and are outside the critical section.

Right.

>> * An assertion would make me feel a lot better about depending on not
>> having a page split from a significant distance.
>
>
> Ok. I assume you mean an assertion to check that the target page doesn't
> have an in-complete split. Added that though not sure if it's useful since
> we concluded that right-most page can't have in-complete split.
>
> Let me know if you mean something else.

I meant something else. I was talking about the assertion discussed
below. I don't see too much point in the !P_INCOMPLETE_SPLIT(lpageop)
assertion, though.

>> Your optimization should continue to not be used when it would result
>> in a page split, but only because that would be slow. The comments
>> should say so, IMV.
>
>
> Added.

I think the wording here could use some tweaking:

> /*
> - * Check if the page is still the rightmost leaf page, has enough
> - * free space to accommodate the new tuple, no split is in progress
> - * and the scankey is greater than or equal to the first key on the
> - * page.
> + * Check if the page is still the rightmost valid leaf page, has
> + * enough free space to accommodate the new tuple and the scankey
> + * is strictly greater than the first key on the page.
> + *
> + * NB: We could take the fastpath even when the target block
> + * doesn't have enough free space (but it's the right-most block)
> + * since _bt_insertonpg() is capable of working with a NULL stack
> + * and that's the only additional thing the slow path sets up. But
> + * we don't optimise for that case because while spliting and
> + * inserting into the parent without the stack is relatively slow
> + * operation.
> */

I would cut this down, and just say "We only insert if it definitely
won't result in a pagesplit" -- no need for the second paragraph in
this high-level routine. The full details can go on top of the new
_bt_insertonpg() assertion I talk about later.

> You mean passing "fastpath" to _bt_insertonpg and then checking it's false
> if page split is needed? But isn't page split is only needed if the page
> doesn't have enough free space? If so, we have checked for that before
> setting "fastpath".

That's not exactly what I meant. I meant that if:

1. This is an insertion to the leaf level in _bt_insertonpg().

and

2. We don't have space on the page, and so must do a split (or at
least free LP_DEAD items).

and

3. RelationGetTargetBlock(rel) != InvalidBlockNumber

There should be an assertion failure. This new assertion within
_bt_insertonpg() makes it much less likely that the assumption breaks
later.

This is where you could point out the low-level details that I
suggested be omitted from _bt_doinsert() at the beginning of this
e-mail. You can mention here that it would actually work without a
pagesplit, but that is only intended for crash recovery, and is a much
slower path that would make the optimization totally
counter-productive. We add an assertion because without one it would
be easy to miss a regression where there is a page split with an empty
stack.

Finally, I'd like to see a small paragraph in the nbtree README, about
the high level theory behind this optimization and page recycling. The
assumption is that there can only be one non-ignorable leaf rightmost
page, and so even a RecentGlobalXmin style interlock isn't required.
We cannot fail to detect that our hint was invalidated, because there
can only be one such page in the B-Tree at any time. It's possible
that the page will be deleted and recycled without a backend's cached
page also being detected as invalidated, but only when we happen to
recycle a page that once again becomes the rightmost leaf page.

Once those changes are made, this should be fine to commit.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Magnus Hagander 2018-04-04 17:25:11 Re: pgsql: Validate page level checksums in base backups
Previous Message Andres Freund 2018-04-04 17:10:46 Re: pgsql: New files for MERGE