From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | "Hiroshi Inoue" <inoue(at)tpf(dot)co(dot)jp> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Improving REINDEX for system indexes (long) |
Date: | 2003-09-21 18:55:01 |
Message-ID: | 13303.1064170501@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers pgsql-patches |
I've been looking at the issues involved in reindexing system tables,
and I now have what I think is a fairly defensible set of proposals.
We should whenever possible use the same reindexing technique used by
CLUSTER: assign a new relfilenode number, build the new index in that
file, and apply an ordinary heap_update operation to update the index's
pg_class row with the new relfilenode value. This technique is fully
crash-safe and transaction-safe (meaning it can be rolled back even after
completion, in case of failure later in the same transaction). However,
there are several pitfalls for applying it to system indexes:
1. There is a problem for a shared index, because we have no way to
update pg_class.relfilenode in all the other databases besides ours.
I see no good way around this problem.
2. There is a problem for a nailed-in-cache index, because the relcache
isn't prepared to cope with relfilenode updates for such indexes.
However, that is fixable.
3. There is a problem for an index on pg_class itself: doing heap_update
on a pg_class row must make new index entries. We have to be careful
that the new row does appear in the updated index, while not making
entries in old-and-possibly-broken indexes. This is doable.
4. There is a problem with updating indexes that might be used for catalog
fetches executed during the index_build process: if we try to use the
partially-rebuilt index for such a fetch we will fail. In the case of
disaster recovery the problem doesn't exist because we'll have
IsIgnoringSystemIndexes true, but for an ordinary on-line indexing
operation there's definitely a risk. Presently the code relies on
"deactivating indexes" to prevent this, but I think it can be done more
simply, because we only need to suppress access to the target index
locally in the reindexing backend. (Other backends will be locked out by
means of an AccessExclusiveLock on the system catalog in question.)
In short then, we can fix things so that the new-relfilenode path can be
taken in all cases except shared indexes. For shared indexes, we have
little choice but to truncate and rebuild the index in-place. (There is
then no need to update its pg_class row at all.) This is of course
horribly crash-unsafe.
The code presently uses a "deactivated indexes" flag (namely, setting
pg_class.relhasindex false) to provide some protection against a crash
midway through an in-place reindex. However, this concept is really quite
useless for a shared index, because only one database could contain the
deactivation flag. Accesses from any other database would still try to
use the broken index.
Based on this analysis, I feel that the "deactivated indexes" mechanism
is of no further value and should be retired. We should instead simply
acknowledge that reindexing shared indexes is dangerous. I propose that
it be allowed only from a standalone backend. Specifically I think that
REINDEX INDEX and REINDEX TABLE should error out if pointed to a shared
index or table, unless running in a standalone backend; REINDEX DATABASE
should skip over the shared indexes unless running in a standalone
backend. (I'm not convinced that either -O or -P needs to be insisted on
by REINDEX, btw, but we can discuss that separately.) We'll have to warn
users not to try to restart the database when reindexing of a shared table
wasn't successfully completed.
Details to back up the above claims:
Part of the code needed to fix the relcache restriction on nailed indexes
is already there, but ifdef'd out; that's the part that re-reads the
index's pg_class row after receiving SI inval for it. There are some
cases not yet handled though. In the first place, if the nailed index
being modified is pg_class_oid_index, ScanPgRelation will try to use that
same index to load the updated row, and will fail because it'll be trying
to use the old relfilenode. We can easily force a seqscan in that case,
however. A more subtle problem is that if an SI buffer overflow occurs,
RelationCacheInvalidate walks through the relation cache in a random
order. I believe that the correct update order has to be first
pg_class_oid_index, then the other nailed indexes, and finally all other
relation cache entries. pg_class_oid_index has to be updated first (with
the aforementioned seqscan), since it will be used by ScanPgRelation to
reread the pg_class rows for the other nailed indexes. Only when we are
sure the nailed indexes are up-to-date can we safely rebuild other
relcache entries.
Assigning a new relfilenode to indexes of pg_class is not a big deal when
we don't have an index-corruption problem. We simply have to make the new
heap_updated row before we start index_build (which indeed the code does
already); both old and new rows will be indexed in the new index, and all
is well. However, if we suspect index corruption then it is important not
to try to make entries into the old indexes, because we might crash trying
to update a corrupt index. I propose the following behavior:
REINDEX INDEX: no special action; this is not the thing to use
when index corruption is suspected anyway.
REINDEX TABLE: while operating on pg_class, arrange for the
relcache's list of known indexes of pg_class to contain only the
indexes already reindexed. In this way, only those indexes will
be updated by CatalogUpdateIndexes(), and so no untrustworthy
indexes will be touched while making new pg_class rows.
REINDEX DATABASE: take care to process pg_class first.
To avoid catalog accesses via an index that's being rebuilt, we can simply
generalize the SetReindexProcessing() state to include the OID of the
index currently being rebuilt. Places that presently make checks for
IsIgnoringSystemIndexes can check this as well and fall back to seqscans
when the targeted index would have been used.
One other point: IsIgnoringSystemIndexes seems presently to be taken to
mean not only that we don't *read* the system indexes, but that we don't
*write* them either. I think this is horribly dangerous; it effectively
means that the only thing you can safely do in a backend started with -P
is REINDEX. If you make any other changes to the system catalogs then
you've re-corrupted the indexes. Also, you don't have any protection
against duplicate entries getting made, since the unique indexes are
disabled. It would be a lot safer to define IsIgnoringSystemIndexes as
preventing the system indexes from being used for lookups, while still
updating the indexes on any catalog modification. This will not affect
the safety of REINDEX and will make other operations much less prone to
pilot error.
Comments?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Hiroshi Inoue | 2003-09-21 19:56:35 | Re: Improving REINDEX for system indexes (long) |
Previous Message | Tom Lane | 2003-09-21 17:57:21 | pgsql-server/src backend/storage/ipc/shmem.c i ... |
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Karppinen | 2003-09-21 19:06:24 | Re: Can't Build 7.3.4 on OS X |
Previous Message | Tom Lane | 2003-09-21 18:04:12 | Re: Align large shared memory allocations |
From | Date | Subject | |
---|---|---|---|
Next Message | Hiroshi Inoue | 2003-09-21 19:56:35 | Re: Improving REINDEX for system indexes (long) |
Previous Message | Tom Lane | 2003-09-21 18:04:12 | Re: Align large shared memory allocations |