Re: create index concurrently - duplicate index to reduce time without an index

From: Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl>
To: <Gareth(dot)Williams(at)csiro(dot)au> <Gareth(dot)Williams(at)csiro(dot)au>
Cc: <greg(at)2ndquadrant(dot)com>, <mabewlun(at)gmail(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: create index concurrently - duplicate index to reduce time without an index
Date: 2010-06-04 09:32:09
Message-ID: A1D7243C-E3CA-47C1-93C8-A7CF1AF3EF5E@solfertje.student.utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> Thanks Greg, Alban and others,
>
> This has cleared up a misunderstanding I had about why one should reindex. Re-reading the documentation http://www.postgresql.org/docs/8.4/interactive/sql-reindex.html it is clear now that reindex or recreating and index should not normally be needed - certainly not to keep an index up-to-date. I would have guessed that VACUUM or VACUUM ANALYSE on the table that the index is associated would have been sufficient to reclaim space for a 'bloated' index (maybe only VACUUM FULL would help). In any case we can leave reindexing or full vacuum for outages where we are interrupting service anyway.

VACUUM FULL actually causes bloat to indexes. It rearranges the data in the tables so that any gaps get used, but while doing that it also needs to update the indices related to those tables.

Normal VACUUM and VACUUM ANALYSE don't have this problem though, they just mark table space that's no longer in use (transaction that deleted rows has committed them) as reusable, so that later INSERTs can put their data in there. This is a bit of a simplification of what's really going on - for the details check the documentation.

Autovacuum does VACUUM ANALYSE in the background, using multiple threads in recent versions. You can (and seeing your use of the database you probably should) tune how often it vacuums tables through several configuration parameters.

Of course, running ANALYSE when you _know_ data in a table has changed significantly means you don't have to wait for autovac to get around to analysing that table.

> I was heartened by the responses and tried further testing (if it could not hurt, why not try and see if it could be faster), but ran into a problem. A few times when I was trying to drop an index (before or after creating a duplicate index with 'concurrently'), the dropping of the index stalled. It seems that this was because of existing connection:
> postgres: rods ICAT 130.102.163.141(58061) idle in transaction
> And new clients block.
>
> Is that expected? Should we look for 'bugs' in out client that seems to be holding a connection?

I'm not exactly sure why that is (I can't look into your database), but my guess is that the index is locked by a transaction. Apparently the transaction you refer to has uncommitted work that depends on the index at some point.

Keeping transactions open for a long time is usually a bad idea.

You saw that you can't drop an index in use by a transaction for example, but autovacuum is running into similar issues - it can't reclaim space until the transaction finishes as the transaction locks things that autovacuum will want to touch.
That probably means (I'm not sure it works that way, but it seems likely) that that autovacuum thread gets stuck at a lock and can't continue until the transaction holding the lock frees it.

> For the application we have, I'm ready to give up on this train of investigation for optimization and just vacuum analyse key tables regularly and vaccuum and maybe reindex more completely during outages - though we aim for outages to be infrequent. The database holds data representing a virtual filesystem structure with millions of file (and associated access controls, and information on underlying storage resources and replication). There is probably not much update or delete of the main data - at least compared with the total holdings and the new data/files which are regularly being added to the system.

In practice VACUUM FULL and REINDEX are used to reclaim disk space. That of itself doesn't look much like it'd improve performance, but using less disk space also means that data gets more tightly packed in your disk cache, for example. REINDEX can mean an index that didn't fit into RAM now does. They're both rather intrusive operations though, so it's a matter of balancing the costs and benefits. Many databases don't need to bother with VACUUM FULL or REINDEX.

> Ps. Greg, I don't understand the issue with 'concurrently rebuild (of) an index that enforces a constraint or unique index'. I don't think I care much right at the moment, but I'm generally interested and others might be too.. Would you expect the create index to fail or to cause locking or just transient performance degradation?

I think what Greg was getting at is that there's a dependency tree between indexes and constraints: A primary key is implemented using a unique index. You can create a new (unique) index on the same columns concurrently, but you can't replace the primary key index with it as you're not allowed to drop the index without dropping the PK constraint. If you have any FK constraints pointing to that table, you can't drop the PK constraint without also dropping the FK constraints.

Quite a bit of trouble to go through to replace one index.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4c08c88410157954111193!

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Luiz Eduardo Cantanhede Neri 2010-06-04 12:07:28 sum multiple tables gives wrong answer?
Previous Message Pavel Stehule 2010-06-04 07:26:03 Re: handling out parameter