Re: BTree vacuum before page splitting

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Junji TERAMOTO <teramoto(dot)junji(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-patches(at)postgresql(dot)org
Subject: Re: BTree vacuum before page splitting
Date: 2006-02-01 07:12:07
Message-ID: 1138777928.3096.75.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

On Tue, 2006-01-31 at 15:18 +0900, Junji TERAMOTO wrote:
> Tom Lane wrote:
> > I think this is quite likely to break things :-(. What sort of
> > conditions have you tested it under? (If this were safe, we'd
> > not have invented the LP_DELETE flag to begin with, but just have
> > deleted known-dead items immediately.)
>
> I apologize for my insufficient description(and bad english).
> I explain the operation of this patch in detail again.
>
> In _bt_insertonpg(), we insert the tupple by the following methods.
>
> (1) Finds the right place(page) to insert the tuple.
> (2) If free space is insufficient, we operate as follows.
> + (2.1) Confirm whether the lock to the found page is considerable
> super-exclusive lock in BufferSuperLockHeldByMe()[new]. We
> check bufHdr->refcount(=pin), and if pin == 1, we know this
> lock is super-exclusive lock.
> If our lock is not super-exclusive lock, goto (2.4).
> + (2.2) If our lock is super-exclusive lock, and the found page is
> leaf, we look for tupples marked as "LP_DELETE" from the found
> page, and remove found tuples in _bt_vacuum_page()[new].
> + (2.3) If free space is sufficient after removal, goto (4).
> (2.4) Step right to next non-dead page.
> (2.5) (2) is repeated until an enough free space is found or it
> reaches a right edge at last.
> (3) When an enough free space is not found by processing (2), splits
> the target page (making sure that the split is equitable as far as
> post-insert free space goes).
> (4) Inserts the tuple.
>
> The numbers from 2.1) to 2.3) are new processing.
>
> The pointer's shifting by removing "LP_DELETE" tuples as shown in the
> above-mentioned becomes a problem, when we resume a stopped indexscan.
> Therefore, I am having it confirm the state of the lock by (2.1), and
> process only at super-exclucive lock, same as btbulkdelete().
>
> However, because own indexscan might be lost because of this removal,
> I also modify _bt_restscan() as follows.
>
> (1) Check for index tuple we stopped on.
> (2) If index tuple is moved, first of all, we search in this page
> right.
> +(3) If not found, we try to look for from the head of this page
> because index tuple may moved left due to removal.
> (4) If still not found, we look for index tuple right.
>
> The number (3) is new processing.
>
> We do not delete the empty page. Therefore, there is no necessity for
> tracing a left page.
>
> I think that no problem occurs by this logic.
> Please point it out if there is a wrong point. Thanks.

Please read comments in nbtree.c p.557-566 which describe the required
locking protocol for btree pages.

Just because you hold the lock does *not* mean you are allowed to delete
tuples marked as deleted. This patch doesn't work in all cases, which is
what is required, since the failure cases don't raise an ERROR - they
just miss data - which is unacceptable. So your patch works 99.9% of the
time, but sometimes gives wrong answers to queries without you knowing.
So, patch rejected, but good thinking all the same.

You will need to argue for a change in the locking protocol before it
could be accepted. That would speed up VACUUM also, so if it were
possible it would be gratefully accepted.

The only help I can give is this: Why does an index scan ever want to
stop on a deleted tuple? Currently, the scan only remembers one tuple
its seen. If the tuple was dead, but wasn't yet marked as dead the index
scan will have read it and stopped there. Some while later, the index
scan will return and begin scanning right for the tuple. If you delete
it, the index scan *may* not be able to work out where to restart --
ERROR. So, if you can work out a way of not restarting an index scan on
a deleted tuple (and yet still marking them as deleted when it sees
them) then you may be able to solve the problem.

I'm looking at the same problem domain also (table bloat), but focusing
on a different approach that avoids this issue.

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Dave Page 2006-02-01 08:35:51 FW: PGBuildfarm member snake Branch REL8_0_STABLE Status changed from OK to Make failure
Previous Message David Fetter 2006-02-01 06:56:05 Re: [BUGS] BUG #2221: Bad delimiters allowed in COPY ...