'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: 'CLUSTER' in one database prevents running it in two others on the same database cluster (PG15.2)
Date: 2023-03-05 21:38:40
Message-ID: ed211b60-2b76-4d98-4284-328916068036@boeringa.demon.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi all,

I am running into a weird issue I haven't encountered before.

This is PostgreSQL 15.2 (Ubuntu 15.2-1.pgdg22.04+1) running on Ubuntu
22.04.2 LTS, with PostGIS (POSTGIS="3.3.2 4975da8" [EXTENSION]
PGSQL="150" GEOS="3.10.2-CAPI-1.16.0" PROJ="8.2.1" LIBXML="2.9.13"
LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)").

Now the PostgreSQL database cluster contains multiple databases. In one
of these, an OpenStreetMap 'Planet' size database, a CLUSTER operation
is running against a very large table of buildings (> 400M records).
This runs fine, and I can see the actual progress in pgAdmin by
inspecting the 'pg_stat_progress_cluster' view. So far everything seems OK.

However, as part of two secondary Python scripts running that do all
kinds of batch operations against spatial tables (generalization and
such) in two other databases on the same database cluster, at the very
end of the processing, these initialize 'CLUSTER' operations as well.
Both of these now fail with a:

"missing chunk number 0 for toast value X in pg_toast_Y"

type error.

After these errors, I see the CLUSTER command listed as well in the
PostgreSQL main log, but the ERROR line mentioned above is listed just
above the actual CLUSTER statement in the log. No other anomaly / error
is visible there. The CLUSTER command just fails though on the two
secondary databases. The time displayed in the log is exactly the same
for these two log lines one after another, also the number between the
square brackets listed immediately after the "CET" of the time are the
same for the same database. The pg_toast_Y table names mentioned in the
errors differ by the way between the two databases.

If I inspect the table that is mentioned in the CLUSTER command in the
log, the very first one that is going to be CLUSTERed in the secondary
databases, the table can be displayed fine in DBeaver, even spatially.
Checking the geometries with PostGIS's 'ST_IsValid' also shows them to
be fine, no errors in the geometries.

Everything else seems OK, database cluster is still up&running, all
tables in all databases can be viewed and accessed in DBeaver and pgAdmin.

In all other rare cases where I have seen the "missing chunk number 0
for toast value X in pg_toast_Y" type error, the generalization
processing already failed somewhere half-way, never at the end when
CLUSTER is initiated. The error therefore seems weird to occur at that
point.

I have now been able to reproduce this twice, with two tiny database's
filled with just Liechtenstein data, while the Planet database is still
in the stage of CLUSTERing the huge building table.

Obviously, it will be interesting to see what happens if the CLUSTER
operation on the Planet database is finished, but this will still take
many hours, especially since other large tables need to be CLUSTERed in
the same database, and I do not want to break off this process right now
after already gotten this far.

I will report back once it is finished, but does anyone have a clue why
this might happen?

The system has plenty of free space by the way (4x2TB NVMe), and the
database runs on a professional workstation with ECC RAM, and used RAM
also doesn't seem an issue.

Marco

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David G. Johnston 2023-03-05 23:18:43 Re: BUG #17822: Some scripts are blocked by Device Guard
Previous Message Howard A. Chou 2023-03-05 18:45:38 found a possible bug, modulus of an integer on a partition table appears to be wrong