Re: Support for REINDEX CONCURRENTLY

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support for REINDEX CONCURRENTLY
Date: 2013-01-26 22:54:43
Message-ID: CAB7nPqRrY0ab2FetDU71t-JbTLuZRA19yjG6BDkp-Ph3q=wLuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 27, 2013 at 1:52 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> On 2013-01-25 13:48:50 +0900, Michael Paquier wrote:
> > All the comments are addressed in version 8 attached, except for the
> > hashtable part, which requires some heavy changes.
> >
> > On Thu, Jan 24, 2013 at 3:41 AM, Andres Freund <andres(at)2ndquadrant(dot)com
> >wrote:
> >
> > > On 2013-01-15 18:16:59 +0900, Michael Paquier wrote:
> > > > Code path used by REINDEX concurrently permits to
> > > > create an index in parallel of an existing one and not a completely
> new
> > > > index. Shouldn't this work for indexes used by exclusion indexes
> also?
> > >
> > > But that fact might safe things. I don't immediately see any reason
> that
> > > adding a
> > > if (!indisvalid)
> > > return;
> > > to check_exclusion_constraint wouldn't be sufficient if there's another
> > > index with an equivalent definition.
> > >
> > Indeed, this might be enough as for CREATE INDEX CONCURRENTLY this code
> > path cannot be taken and only indexes created concurrently can be
> invalid.
> > Hence I am adding that in the patch with a comment explaining why.
>
> I don't really know anything about those mechanics, so some input from
> somebody who does would be very much appreciated.
>
> > > > > > + /*
> > > > > > + * Phase 2 of REINDEX CONCURRENTLY
> > > > > > + */
> > > > > > +
> > > > > > + /* Get the first element of concurrent index list */
> > > > > > + lc2 = list_head(concurrentIndexIds);
> > > > > > +
> > > > > > + foreach(lc, indexIds)
> > > > > > + {
> > > > > > + WaitForVirtualLocks(*heapLockTag, ShareLock);
> > > > >
> > > > > Why do we have to do the WaitForVirtualLocks here? Shouldn't we do
> this
> > > > > once for all relations after each phase? Otherwise the waiting
> time will
> > > > > really start to hit when you do this on a somewhat busy server.
> > > > >
> > > > Each new index is built and set as ready in a separate single
> > > transaction,
> > > > so doesn't it make sense to wait for the parent relation each time.
> It is
> > > > possible to wait for a parent relation only once during this phase
> but in
> > > > this case all the indexes of the same relation need to be set as
> ready in
> > > > the same transaction. So here the choice is either to wait for the
> same
> > > > relation multiple times for a single index or wait once for a parent
> > > > relation but we build all the concurrent indexes within the same
> > > > transaction. Choice 1 makes the code clearer and more robust to my
> mind
> > > as
> > > > the phase 2 is done clearly for each index separately. Thoughts?
> > >
> > > As far as I understand that code its purpose is to enforce that all
> > > potential users have an up2date definition available. For that we
> > > acquire a lock on all virtualxids of users using that table thus
> waiting
> > > for them to finish.
> > > Consider the scenario where you have a workload where most transactions
> > > are fairly long (say 10min) and use the same tables (a,b)/indexes(a_1,
> > > a_2, b_1, b_2). With the current strategy you will do:
> > >
> > > WaitForVirtualLocks(a_1) -- wait up to 10min
> > > index_build(a_1)
> > > WaitForVirtualLocks(a_2) -- wait up to 10min
> > > index_build(a_2)
> > >
> > ...
> > >
> > > So instead of waiting up 10 minutes for that phase you have to wait up
> > > to 40.
> > >
> > This is necessary if you want to process each index entry in a different
> > transaction as WaitForVirtualLocks needs to wait for the locks held on
> the
> > parent table. If you want to fo this wait once per transaction, the
> > solution would be to group the index builds in the same transaction for
> all
> > the indexes of the relation. One index per transaction looks more solid
> in
> > this case if there is a failure during a process only one index will be
> > incorrectly built.
>
> I cannot really follow you here.
>
OK let's be more explicit...

> The reason why we need to wait here is
> *only* to make sure that nobody still has the old list of indexes
> around (which probably could even be relaxed for reindex concurrently,
> but thats a separate optimization).
>
In order to do that, you need to wait for the *parent relations* and not
the index themselves, no?
Based on 2 facts:
- each index build is done in a single transaction
- a wait needs to be done on the parent relation before each transaction
You need to wait for the parent relation multiple times depending on the
number of indexes in it. You could optimize that by building all the
indexes of the *same parent relation* in a single transaction.

So, for example in the case of this table:
CREATE TABLE tab (col1 PRIMARY KEY, col2 int);
CREATE INDEX int ON tab (col2);
If the primary key index and the second index on col2 are built in a unique
transaction, you could wait only once for the locks on the parent relation
'tab' only once.

So if we wait for all relevant transactions to end before starting phase
> 2 proper, we are fine, independent of how many indexes we build in a
> single transaction.
>
The reason why all the index builds are done in a single transaction is
that you mentioned in a previous review (v3?) that we should do the builds
in a single transaction for *each* index. What looked fair based on the
fact that the transaction time for each index could be reduced, the
downside being that you wait more on the parent relation.

> > > Btw, seing that we have an indisvalid check the toast table's index, do
> > > we have any way to cleanup such a dead index? I don't think its allowed
> > > to drop the index of a toast table. I.e. we possibly need to relax that
> > > check for invalid indexes :/.
> > >
> > For the time being, no I don't think so, except by doing a manual cleanup
> > and remove the invalid pg_class entry in catalogs. One way to do thath
> > cleanly could be to have autovacuum remove the invalid toast indexes
> > automatically, but it is not dedicated to that and this is another
> > discussion.
>
> Hm. Don't think thats acceptable :/
>
> As I mentioned somewhere else, I don't see how to do an concurrent build
> of the toast index at all, given there is exactly one index hardcoded in
> tuptoaster.c so the second index won't get updated before the switch has
> been made.
>
> Haven't yet looked at the new patch - do you plan to provide an updated
> version addressing some of the remaining issues soon? Don't want to
> review this if you nearly have the next version available.
>
Before providing more effort in coding, I think it is better to be clear
about the strategy to use on the 2 following points:
1) At the index build phase, is it better to build each index in a single
separate transaction? Or group the builds in a transaction for each parent
table? This is solvable but the strategy should be clear.
2) Find a solution for invalid toast indexes, which is not that easy. One
solution could be to use an autovacuum process to clean up the invalid
indexes of toast tables automatically. Another solution is to skip the
reindex for toast indexes, making the feature less usable.

If a solution or an agreement is not found for those 2 points, I think it
will be fair to simply reject the patch.
It looks that this feature has still too many disadvantages compared to the
advantages it could bring in the current infrastructure (SnapshotNow
problems, what to do with invalid toast indexes, etc.), so I would tend to
agree with Tom and postpone this feature once infrastructure is more
mature, one of the main things being the non-MVCC'ed catalogs.
--
Michael Paquier
http://michael.otacoo.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2013-01-26 23:38:11 Re: Strange Windows problem, lock_timeout test request
Previous Message Tom Lane 2013-01-26 22:36:51 Re: enhanced error fields