From: | Hannu Krosing <hannu(at)tm(dot)ee> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Mike Mascari <mascarm(at)mascari(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, andrew(at)supernews(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: CONCURRENT INDEXing again (was: Must be owner to |
Date: | 2005-07-12 21:03:22 |
Message-ID: | 1121202202.5358.0.camel@fuji.krosing.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On T, 2005-07-12 at 12:20 -0400, Tom Lane wrote:
> Hannu Krosing <hannu(at)skype(dot)net> writes:
> > I try to state my reworked idea of concurrent indexing in a more clear
> > way:
>
> > The index build in done 2 transactions, need one new type of lock and a
> > new system column in pg_class to tell planner not to use an incomplete
> > index. This similar to vacuum in thet ot needs its own transactions and
> > is not rollbackable.
>
> "Not rollbackable" is certainly "not acceptable"... what if there's a
> crash partway through? VACUUM FULL is designed so that there isn't
> anything special that needs to be done to clean up after it if it fails
> partway through, but this seems to be untrue of your proposal. You'd
> have committed-created but useless indexes that have to be got rid of
> somehow.
I meant non-rollbackable in the sense that it can't be rolled back by user command.
If a there is an index marked "incomplete" after recovery, it is simply
dropped, as it was logically never there anyway.
If it fails in the first stage, it is the same to what happens with
current implementation.
> > To be sure that we cover all the tuples in range <= MAX_CTID_1, and no
> > new tuples are stored there as the result of INSERT or UPDATE, we need a
> > new type of lock (lets call it "TupleStoreRangeLock"), which prevents
> > new tuples to be placed below MAX_CTID_1 and which is aquired before
> > starting the initial build.
>
> Checking for such a lock will put a nontrivial distributed cost on every
> insert and update, whether the facility is in use or not.
Do we not take at least some kind of lock at each insert/update now ?
I guess we do, how else am I prevented from changing a table during
create index now ?
We are currently most likely doing the locking _after_ deciding where to
put the tuple, so taking that lock should inform us about existance of
TupleStoreRangeLock on that relation if there is one and force a
reallocation of tuple space. This way there is very little extra cost
for the case with no TupleStoreRangeLock.
My aim is not making concurrent indexing as fast as current indexing
with exclusive lock, just making it bearable on 24/7 system.
And I'd like to run it in the same kind of situation as my VACUUM patch,
so that it won't show up in VACUUMs GetOldestXmin calculations. In this
case it could run for hours/days/weeks ;) if that is needed for doing it
without halting the system.
> > After the initial build of index for tuples below MAX_CTID_1 is
> > finished, it is made visible to the rest of the system by committing the
> > transaction, but marking the index as "incomplete" (probably a new
> > column in pg_class is needed for that), so that it will not be used by
> > planner, but all new inerts/updates will see and use it.
>
> I can see how this might work for a new index build, but it doesn't work
> for REINDEX --- you get to have other transactions inserting into either
> the old or the new index, not both. You could possibly make it work by
> creating a complete new index with its own OID, and then swapping that
> out for the old index at completion --- but see below.
That's what I meant - create another index, then drop old one and rename
the new to old. Index names are not _that_ important. And I see REINDEX
really as a shorthand for
BEGIN;
CREATE INDEX newndx;
DROP INDEX oldndx;
ALTER TABLE newndx RENAME TO oldndx;
COMMIT;
without the need to look up the index definition.
Or is there anything more REINDEX provides us ?
And even the BEGIN/COMMIT dont give us any added benefit - any sane
query plan won't use 2 similar indexes in the same query anyway even if
they see them. for only a very brief moment will there be slight extra
burden for planner in considering 2 similar indexes.
> > After that we need to wait for all other running transactions to
> > complete, so we can be sure that all other backends know about the new
> > index.
>
> That could be a pretty long time ... and presumably the index build is
> pretty long, too, or we'd not be bothering with all this mechanism.
<sarcasm>Of course it would take less time if all other critical, "need
to complete below 5 sec", transactions were waiting for the create
index to complete.</sarcasm>
I mainly see a use case for concurrent indexing in 24/7 OLTP systems,
which can't allow even scheduled downtime. There are generally no long
transactions, except VACUUM or COPY and these can be scheduled to run at
different time from create index.
> All the while, tuples are getting inserted into highly nonoptimal pages
> far from the start of the table. Doesn't this idea conflict rather
> badly with your desire expressed nearby to force tuples to be inserted
> near the front of the table?
They are for two different use-cases, both needed in maintaining 24/7
systems:
1. concurrent index is needed for a *big* table for either replacing a
*bloated* *index* or creating a new index
2. placing new tuples near beginning is for getting down the size of a
*bloated* *table* (say 1-2% live tuples).
the first one is currently solvable only with slony (not log-shipping)
type of replications and switchovers.
the second one can be done using some extra non-functional updates and
Both are needed to fight preformance degradations (usually sharp ones
which happen when a significant ration of index pages does not fit in
RAM any more or when say 5000 often updated records are spread over
50000 pages thanks to inability to vacuum them often enough during
running another vacuum on big table or doing a backup using pg_dump.
The sparce table problem is not too bad by itself (index access is still
fast), but both vacuums and reindexes are 1) slow as they use seqscans
internally and 2) they pollute RAM with empty pages that could be used
for better ends.
> > As the final step we need to scan all tuples in range ( MAX_CTID_1 to
> > MAX_CTID_2 ) and insert their corresponding index entries into the new
> > index. If the entry already exists for exact same ctid, that is ok.
>
> I don't think that's as easy as it sounds; at the very least it requires
> mechanism comparable to the unique-index checking code, which we don't
> have for any index type except btree.
I currently dont use anything besides btree, so I would be happy with
that.
> Also, in an index that's actually
> highly non-unique, that mechanism is *expensive* --- you may have to
> scan many pages of identically-keyed entries to see if any of them match
> the target ctid ... all the while holding a lock that prevents anyone
> else from inserting on the same starting page.
We could consider ctid part of the index, and place ctids in-order in
btrees, (and probably consider ctid and additional dimension in
gist/rtrees). That would likely give some of the similar benefits to
bitmap scanning in indexscan locality. Now that I think of it, highly
non-unique indexes should have been built like that all along and more
unique indexes need not care, so this applies to *all* (btree) indexes.
> What's more, the pseudo uniqueness check has to be done on both sides
> --- else index build might decide the entry's not there, insert it, only
> to have the original tuple inserter come along right after and insert
> again.
Yes, this surely need to be avoided.
Waiting for other transactions to finish and then locing the area
between (MAX_CTID_1 and MAX_CTID_2) should make sure that this never
happens.
> So this is a second major operational mode that has to affect
> everybody in the system, not only index build. I'm not sure whether
> there are race conditions associated with getting in and out of this
> mode, but it wouldn't surprise me.
Me neither.
> > After reaching MAX_CTID_2, the index is ready for use by planner as well
> > and can be marked as "complete" in pg_class. In case of REINDEX, the new
> > index can be made to replace the old one at this point.
>
> AFAICS, the "replace" bit requires exclusive lock to make sure that no
> one is in the midst of using the old index. This means that you have a
> situation where you need to upgrade your table lock at the very end of
> the operation --- which means the whole thing is prone to failing at the
> very end because of deadlock.
This is definitely not something I would wan't.
Why not make the whole REINDEX just a "macro" doing create/drop/rename.
Or if this is hard, then let the needy DBAs do all three manually.
The main thing is the ability to add indexes to big tables in 24/7 OLTP
system.
--
Hannu Krosing <hannu(at)skype(dot)net>
--
Hannu Krosing <hannu(at)tm(dot)ee>
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2005-07-12 21:06:27 | Re: [PATCHES] HEAD doesn't cope with libraries in non-default |
Previous Message | David Fetter | 2005-07-12 21:01:36 | Re: PL/Perl list value return causes segfault |