Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs

From: Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Alexander Korotkov <akorotkov(at)postgresql(dot)org>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Maxim Orlov <orlovmg(at)gmail(dot)com>, Svetlana Derevyanko <s(dot)derevyanko(at)postgrespro(dot)ru>
Subject: Re: pgsql: reindexdb: Add the index-level REINDEX with multiple jobs
Date: 2025-03-07 18:20:34
Message-ID: 202503071820.j25zn3lo4hvn@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 2024-Mar-25, Alexander Korotkov wrote:

> reindexdb: Add the index-level REINDEX with multiple jobs
>
> Straight-forward index-level REINDEX is not supported with multiple jobs as
> we cannot control the concurrent processing of multiple indexes depending on
> the same relation. Instead, we dedicate the whole table to certain reindex
> job. Thus, if indexes in the lists belong to different tables, that gives us
> a fair level of parallelism.

I tested this, because of a refactoring suggestion [1] and I find that
it's rather completely broken. I ran this to setup a bunch of tables
that I'd want reindexed in parallel:

create table foo (a int);
insert into foo select * from generate_series(1, (10^7)::numeric);
create index foo1 on foo (a);
create index foo2 on foo (a);
create table bar (a int);
insert into bar select * from generate_series(1, (2 * (10^7))::numeric);
create index bar1 on bar (a);
create index bar2 on bar (a);
create table baz (a int);
insert into baz select * from generate_series(1, (5 * (10^6))::numeric);
create index baz1 on baz (a);
create index baz2 on baz (a);
create index baz3 on baz (a);
create index baz4 on baz (a);

I then run this:
reindexdb -j4 --echo -i foo1 -i foo2 -i bar1 -i bar2 -i baz1 -i baz2 -i baz3 -i baz4 | grep REINDEX

Looking at active processes with psql's \watch during the run, I learn
that what happens is that we process the indexes on baz first, without
any other process in parallel, until we get to the last one of baz
table, and we start processing one from baz and one from foo in
parallel. But when the one in baz is done, we only continue with one
process until the list reaches indexes of 'bar', and we process two in
parallel, and then we ran out of indexes in foo so we complete without
any more paralellism.

This is a waste and surely not what was intended: surely what we want is
that given that we have more parallel jobs available than there are
tables, we would start processing the first index of each table at
roughly the same time, namely right at program start. That would keep
three processes occupied until we ran out of indexes on one table, so
we'd keep two processes occupied, and so on. But this is not what
happens.

Am I misunderstanding something?

[1] https://postgr.es/m/CAEudQApP=u5-9PR_fs1DpZToQNrtTFSP+_fjrOgfi73UkrBXKQ@mail.gmail.com

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2025-03-07 18:22:56 pgsql: tests: Don't fail due to high default timeout in postmaster/003_
Previous Message Robert Haas 2025-03-07 14:12:34 pgsql: doc: Add missing decimal places to example rowcount.

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-03-07 18:27:30 Re: Add column name to error description
Previous Message Navneet Kumar 2025-03-07 18:09:34 Re: support virtual generated column not null constraint