Quick Links

Re: Musings

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Musings
Date:	2002-05-06 03:50:54
Message-ID:	29182.1020657054@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I said:
> On the other hand --- if the index *is* unique, and we are checking
> equality on all columns (a fairly easily checked condition), then we
> know we should retrieve at most one visible tuple. So, without making
> any incorrect assumptions, we could terminate the indexscan after the
> first successful match. Hmm ... you might be right that there's a
> cheap win to be had there.

I tried this out on a quick-hack basis of just teaching IndexNext to
terminate an indexscan once it's gotten a single visible tuple out of
a unique index. It works pretty much as expected, but I didn't see any
noticeable improvement in pgbench speed. Investigation with gprof
led to the following conclusions:

1. pgbench spends an awful lot of its time in _bt_check_unique, which
I hadn't touched. (AFAICS it couldn't be sped up anyway with this
technique, since in the non-error case it won't find any visible
tuples.) I think the only real hope for speeding up _bt_check_unique
is to mark dead index entries so that we can avoid repeated heap_fetches.

2. When I said that new index entries would be visited first because
they're inserted to the left of existing entries of the same key,
I forgot that that's only true when there's room for them there.
The comments in nbtinsert.c give a more complete picture:

* NOTE: if the new key is equal to one or more existing keys, we can
* legitimately place it anywhere in the series of equal keys --- in fact,
* if the new key is equal to the page's "high key" we can place it on
* the next page. If it is equal to the high key, and there's not room
* to insert the new tuple on the current page without splitting, then
* we can move right hoping to find more free space and avoid a split.
* (We should not move right indefinitely, however, since that leads to
* O(N^2) insertion behavior in the presence of many equal keys.)
* Once we have chosen the page to put the key on, we'll insert it before
* any existing equal keys because of the way _bt_binsrch() works.

If we repeatedly update the same tuple (keeping its index key the same),
after awhile the first btree page containing that index key will fill
up, and subsequently we will tend to insert duplicate entries somewhere
in the middle of the multiple-page sequence of duplicates. We could
guarantee that the newest tuple is visited first only if we were
prepared to split the first btree page in the above-described case,
rather than looking for subsequent pages with sufficient room to insert
the index entry. This seems like a bad idea --- it would guarantee
inefficient space usage in the index, because as soon as we had a series
of duplicate keys spanning multiple pages, we would *never* attempt to
insert into the middle of that series, and thus freed space within the
series of pages would go forever unused.

So my conclusion is that this idea is probably worth doing, but it's not
earthshaking by itself.

The major difficulty with doing it in a non-hack fashion is that there
isn't a clean place to insert the logic. index_getnext and subroutines
cannot apply the test, because they don't know anything about tuple
visibility (they don't get passed the snapshot being used). So without
restructuring, we'd have to teach all the couple-dozen callers of
index_getnext what to do.

I have been thinking for awhile that we need to restructure the
indexscan API, however. There is no good reason why index_getnext
*shouldn't* be responsible for time qual checking, and if it were
then it could correctly apply the one-returned-tuple rule. Even
more important, it would then become possible for index_getnext to
also be the place that detects completely-dead tuples and notifies
the index AMs to mark those index entries as uninteresting.

Basically I'd like to make index_getnext have an API essentially the
same as heap_getnext. There are a few callers that actually want
index_getnext's behavior (fetch index tuples but not heap tuples),
but we could create an alternative subroutine for them to use.

A more detailed proposal will follow by and by...

regards, tom lane

In response to

Re: Musings at 2002-05-05 18:15:56 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2002-05-06 04:00:46	Re: Native Windows, Apache Portable Runtime
Previous Message	Marc G. Fournier	2002-05-06 03:43:02	Re: Native Windows, Apache Portable Runtime