Re: Add parallelism and glibc dependent only options to reindexdb

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Kevin Grittner <kgrittn(at)gmail(dot)com>
Subject: Re: Add parallelism and glibc dependent only options to reindexdb
Date: 2019-07-01 08:54:53
Message-ID: 20190701085453.GA1388@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 30, 2019 at 11:45:47AM +0200, Julien Rouhaud wrote:
> With the glibc 2.28 coming, all users will have to reindex almost
> every indexes after a glibc upgrade to guarantee the lack of
> corruption. Unfortunately, reindexdb is not ideal for that as it's
> processing everything using a single connexion and isn't able to
> discard indexes that doesn't depend on a glibc collation.

We have seen that with a database of up to 100GB we finish by cutting
the reindex time from 30 minutes to a couple of minutes with a schema
we work on. Julien, what were the actual numbers?

> PFA a patchset to add parallelism to reindexdb (reusing the
> infrastructure in vacuumdb with some additions) and an option to
> discard indexes that doesn't depend on glibc (without any specific
> collation filtering or glibc version detection), with updated
> regression tests. Note that this should be applied on top of the
> existing reindexdb cleanup & refactoring patch
> (https://commitfest.postgresql.org/23/2115/)

Please note that patch 0003 does not seem to apply correctly on HEAD
as of c74d49d4. Here is also a small description of each patch:
- 0001 refactors the connection slot facility from vacuumdb.c into a
new, separate file called parallel.c in src/bin/scripts/. This is not
really fancy as some code only moves around.
- 0002 adds an extra option for simple lists to be able to use
pointers, with an interface to append elements in it.
- 0003 begins to be the actual fancy thing with the addition of a
--jobs option into reindexdb. The main issue here which should be
discussed is that when it comes to reindex of tables, you basically
are not going to have any conflicts between the objects manipulated.
However if you wish to do a reindex on a set of indexes then things
get more tricky as it is necessary to list items per-table so as
multiple connections do not conflict with each other if attempting to
work on multiple indexes of the same table. What this patch does is
to select the set of indexes which need to be worked on (see the
addition of cell in ParallelSlot), and then does a kind of
pre-planning of each item into the connection slots so as each
connection knows from the beginning which items it needs to process.
This is quite different from vacuumdb where a new item is distributed
only on a free connection from a unique list. I'd personally prefer
if we keep the facility in parallel.c so as it is only
execution-dependent and that we have no pre-planning. This would
require keeping within reindexdb.c an array of lists, with one list
corresponding to one connection instead which feels more natural.
- 0004 is the part where the concurrent additions really matter as
this consists in applying an extra filter to the indexes selected so
as only the glibc-sensitive indexes are chosen for the processing.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2019-07-01 09:20:41 Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT
Previous Message Thomas Munro 2019-07-01 08:46:09 Re: POC: Cleaning up orphaned files using undo logs