Re: ToDo List Item - System Table Index Clustering

From: "Simone Aiken" <saiken(at)quietlyCompetent(dot)com>
To: "'pgsql-hackers'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo List Item - System Table Index Clustering
Date: 2011-01-20 21:40:09
Message-ID: 00c101cbb8ea$9c96fcf0$d5c4f6d0$@quietlyCompetent.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

After playing with this in benchmarks and researching the weird results I
got I'm going to advise dropping the todo for now unless something happens
to change how postgres handles clustering. You guys probably already
grokked this so I am just recording it for the list archives.

The primary factor here is that postgres doesn't maintain clustered indexes.
Clustering is a one-time operation that clusters the table at this current
point in time. Basically, there really isn't any such thing in postgres as
a clustered index. There is an operation - Cluster - which takes an index
and a table as input and re-orders the table according to the index. But
it is borderline fiction to call the index used "clustered" because the next
row inserted will pop in at the end of the table instead of slipping into
the middle of the table per the desired ordering.

All the pg_table cluster candidates are candidates because they have a row
per table column and we expect that a query will want to get several of
these rows at once. These rows are naturally clustered because the scripts
that create them insert their information into the catalog contiguously.
When you create a catalog table the pg_attribute rows for its columns are
inserted together. When you then create all its triggers they too are put
into pg_triggers one after the other. So calling the Cluster operation
after dbinit doesn't help anything.

Over time table alterations can fragment this information. If a user loads
a bunch of tables, then alters them over time the columns added later on
will have their metadata stored separately from the columns created
originally.

Which gets us to the down and dirty of how the Cluster function works. It
puts an access exclusive lock on the entire table - blocking all attempts to
read and write to the table - creates a copy of the table in the desired
order, drops the original, and renames the copy. Doing this to a catalog
table that is relevant to queries pretty much brings everything else in the
database to a halt while the system table is locked up. And the brute force
logic makes this time consuming even if the table is perfectly ordered
already. Additionally, snapshots taken of the table during the Cluster
operation make the table appear to be empty which introduces the possibility
of system table corruption if transactions are run concurrently with a
Cluster operation.

So basically, the Cluster operation in its current form is not something you
want running automatically on a bunch of system table as it is currently
implemented. It gives your system the hiccups. You would only want to run
it manually during downtime. And you can do that just as easily with or
without any preparation during dbinit.

Thanks everyone,

-Simone Aiken

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-01-20 21:41:45 Re: Large object corruption during 'piped' pg_restore
Previous Message Tom Lane 2011-01-20 21:36:50 Re: Orphaned statements issue