Fwd: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Fwd: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)
Date: 2023-03-06 06:46:49
Message-ID: 7c34313b-d285-7df9-0561-540fe8e1b2d0@boeringa.demon.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Well, I can now confirm that if the CLUSTER operation on the 'Planet'
size database is finished, the CLUSTER on the secondary databases runs
fine, no "missing chunk number 0" error.

I wonder if there is a relation with the currently discussed issue here,
as I see in the PostgreSQL documentation, that CLUSTER also requires an
ACCESS EXCLUSIVE lock, that is also referred in this thread?:

https://www.postgresql.org/message-id/flat/17812-206eaec1d133a94a%40postgresql.org

Is my issue possibly the "concrete use case", that David J. is referring
to?:

"I agree this does seem like a poor risk/reward on the fixing side,
especially absent a concrete live use case problem. I am curious what led
to this discovery."

Marco

-------- Doorgestuurd bericht --------
Onderwerp: 'CLUSTER' in one database prevents running it in two others
on the same database cluster (PG15.2)
Datum: Sun, 5 Mar 2023 22:38:40 +0100
Van: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
Aan: pgsql-bugs(at)lists(dot)postgresql(dot)org

Hi all,

I am running into a weird issue I haven't encountered before.

This is PostgreSQL 15.2 (Ubuntu 15.2-1.pgdg22.04+1) running on Ubuntu
22.04.2 LTS, with PostGIS (POSTGIS="3.3.2 4975da8" [EXTENSION]
PGSQL="150" GEOS="3.10.2-CAPI-1.16.0" PROJ="8.2.1" LIBXML="2.9.13"
LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)").

Now the PostgreSQL database cluster contains multiple databases. In one
of these, an OpenStreetMap 'Planet' size database, a CLUSTER operation
is running against a very large table of buildings (> 400M records).
This runs fine, and I can see the actual progress in pgAdmin by
inspecting the 'pg_stat_progress_cluster' view. So far everything seems OK.

However, as part of two secondary Python scripts running that do all
kinds of batch operations against spatial tables (generalization and
such) in two other databases on the same database cluster, at the very
end of the processing, these initialize 'CLUSTER' operations as well.
Both of these now fail with a:

"missing chunk number 0 for toast value X in pg_toast_Y"

type error.

After these errors, I see the CLUSTER command listed as well in the
PostgreSQL main log, but the ERROR line mentioned above is listed just
above the actual CLUSTER statement in the log. No other anomaly / error
is visible there. The CLUSTER command just fails though on the two
secondary databases. The time displayed in the log is exactly the same
for these two log lines one after another, also the number between the
square brackets listed immediately after the "CET" of the time are the
same for the same database. The pg_toast_Y table names mentioned in the
errors differ by the way between the two databases.

If I inspect the table that is mentioned in the CLUSTER command in the
log, the very first one that is going to be CLUSTERed in the secondary
databases, the table can be displayed fine in DBeaver, even spatially.
Checking the geometries with PostGIS's 'ST_IsValid' also shows them to
be fine, no errors in the geometries.

Everything else seems OK, database cluster is still up&running, all
tables in all databases can be viewed and accessed in DBeaver and pgAdmin.

In all other rare cases where I have seen the "missing chunk number 0
for toast value X in pg_toast_Y" type error, the generalization
processing already failed somewhere half-way, never at the end when
CLUSTER is initiated. The error therefore seems weird to occur at that
point.

I have now been able to reproduce this twice, with two tiny database's
filled with just Liechtenstein data, while the Planet database is still
in the stage of CLUSTERing the huge building table.

Obviously, it will be interesting to see what happens if the CLUSTER
operation on the Planet database is finished, but this will still take
many hours, especially since other large tables need to be CLUSTERed in
the same database, and I do not want to break off this process right now
after already gotten this far.

I will report back once it is finished, but does anyone have a clue why
this might happen?

The system has plenty of free space by the way (4x2TB NVMe), and the
database runs on a professional workstation with ECC RAM, and used RAM
also doesn't seem an issue.

Marco

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2023-03-06 07:26:42 BUG #17823: Generated columns not always updated correctly
Previous Message Alexander Lakhin 2023-03-06 06:00:01 Re: BUG #17116: Assert failed in SerialSetActiveSerXmin() on commit of parallelized serializable transaction