Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?

From: Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Fwd: "SELECT COUNT(*) FROM" still causing issues (deadlock) in PostgreSQL 14.3/4?
Date: 2022-07-25 06:04:52
Message-ID: f6175d33-0e5d-480a-6e08-560ec4d25a0e@boeringa.demon.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

To extend on this, two interesting questions that come to mind are:

- Does running SELECT COUNT(*) create WAL?

- Is it potentially conceivable that there is a kind of cross-database
vulnerability *within one and the same PostgreSQL cluster*, where an
issue in one database causes the WAL in another database to no longer
successfully be written to disk during checkpoints? I have never seen
processing errors where PostgreSQL emitted true PostgreSQL errors with
error numbers cause issues like that and affect a second database in the
same cluster, but since no error is generated here, and there might be
some uncatched error, I wonder?

I am especially asking the second question since, although I wrote there
is no edit activity going on potentially generating WAL in the affected
small database, which is true, there *was* processing on Planet sized
data going on in a second database in the same cluster. That certainly
*is* capable of generating 890GB of WAL if nothing is cleaned up during
checkpoints due to checkpoints failing.

Marco

Op 24-7-2022 om 09:55 schreef Marco Boeringa:
> Hi Tom,
>
> Thanks for the response, but please review the exact conditions I
> already mentioned in my previous mails:
>
> - When this issue happens, there is absolutely no other activity going
> on than the three active sessions I mentioned: the autovacuum worker
> with no wait event, and the two "SELECT COUNT(*) FROM <table>" related
> sessions with both a wait event, see the original post. There are no
> other active sessions doing any kind of editing work, no INSERTS,
> UPDATES, DELETES or whatever in this point of the processing that
> could generate WAL.
>
> - Note that this is a custom written geoprocessing workflow with just
> one user on the database, not a public database with hundreds of users
> emitting whatever unknown queries against the database, so I know
> exactly at what point in my processing flow it fails and what goes on
> then.
>
> - The database affected itself is just a few dozen GBs. While I
> appreciate, if I understand PostgreSQL and the concept of WAL good
> enough (I don't consider myself a PostgreSQL expert), that WAL might
> potentially exceed the size of the database when heavy editing is
> going, 890 GB of WAL being written seems like an anomaly given in the
> context of the first points.
>
> - This problem only first reared its head after the issues starting in
> PG14.2 related to SELECT COUNT(*)
>
> So, does your suggested option (b) still make sense in this context?
>
> If not, and we assume this is a bug needing reporting, what exact
> information will you guys need to pinpoint the issue besides the
> information already given? What is the best course of action? I have
> never before reported a bug for PostgreSQL, so I am slightly at loss
> as to what exact information you will need. E.g., besides your
> suggestion of activating 'log_checkpoints', what other suggestions for
> specific logging?
>
> I fully appreciate the main answer will be to submit the typical
> "smallest reproducible case", but that will be extremely hard in this
> particular case, as the geoprocessing workflow processing
> OpenStreetMap data goes through a whole chain of largely
> auto-generated SQL statements (based on settings in the input tool),
> that are nearly impossible to share. Although it is also again
> questionable if it is actually relevant, as the point where it fails
> only has the mentioned sessions and single SELECT COUNT(*) SQL
> statement going on. The issues is intermittent as well, so there
> wouldn't be guarantees it would reproduce on the first try, even if I
> could share it.
>
> I also appreciate I might need to hire an expert for some remote
> debugging, but before going that way, I appreciate some more insights.
>
> Marco
>
> Op 23-7-2022 om 17:33 schreef Tom Lane:
>> Marco Boeringa <marco(at)boeringa(dot)demon(dot)nl> writes:
>>> Ok, I found it out using 'sudo baobab'. It is the:
>>> 'var/lib/postgresql/14/main/pg_wal'
>>> folder that is filled up with 890 GB of data... causing the file system
>>> root to run out of space and Ubuntu opening the disk usage analyzer and
>>> a warning as a consequence.
>> The most likely explanations for this are
>> (a) misconfiguration of WAL archiving, so that the server thinks
>> it should keep WAL files till they've been archived, only that
>> never happens.
>> (b) inability to complete checkpoints for some reason, preventing
>> WAL files from being recycled.
>>
>> It doesn't look like you have wal_archiving on, so (a) *should*
>> be ruled out, but you never know.  If there are a ton of "nnn.ready"
>> files underneath pg_wal then trouble here would be indicated.
>>
>> As for (b), you might try enabling log_checkpoints and seeing if
>> the log messages give any clue.
>>
>>             regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kyotaro Horiguchi 2022-07-25 08:11:32 Re: could not link file in wal restore lines
Previous Message Kyotaro Horiguchi 2022-07-25 02:36:08 Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words