Re: Undetected Deadlock

From: Michael Harris <harmic(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Undetected Deadlock
Date: 2022-01-27 00:20:59
Message-ID: CADofcAXSn_oS3MOxS=1epiXirwETOB_BpBmM8w4=F3GaQe+cwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Alvaro

Thanks for the feedback!

> What version were you using previously?

We were previously on 11.4. Another difference is that we were using
inheritance based partitioning before, whereas now we are using
declarative partitioning.

> Maybe the lock is already taken before the DELETE is run; do you have
> any triggers, rules, constraints, or anything?

There are no triggers, rules or constraints on the table involved in
the DELETE (either the partition or the table that the partition is
part of).

Even more confusingly - when I reproduce the SQL statements that
should be triggering the deadlock, it does not happen: the DELETE does
not attempt to take an AccessShareLock on the parent table, so it does
not deadlock.

Is there any state associated with a transaction or a database
connection that would alter the lock(s) that gets taken out for a
DELETE on a partition? Or can other concurrent transactions in other
processes cause more locks to be needed somehow?

> If you have seen this several times already, maybe a way to investigate deeper is an
> exhaustive log capture of everything that these transactions do

So far it has happened at least twice. There were a couple of other
incidents that may well also have been caused by this, but not enough
data was collected at the time to be sure.

A bit more detail: the two processes that are deadlocked here are one
that is ingesting new data, while the other is removing old data by
dropping partitions. Even before we shifted to 14.1 and native
partitioning, we did get deadlocks between these two processes every
so often which we could not really prevent, so we adopted a retry
approach when it does occur. However we never had an undetected
deadlock in the database.

Since going to 14.1 & native partitioning, we are seeing a lot more
frequent deadlocks. It looks like the increase in frequency might be
related to the extra lock taken by the DELETE that I mentioned above.
However most of the time the deadlock is detected and released by
Postgres and the impact is minimal. Of course it is another story if
it is not detected!

I have enabled `log_statement=all`, but the undetected deadlock hasn't
happened again since. I can easily reproduce the deadlock itself, but
not the undetected case.

Thanks again.

Cheers
Mike

On Wed, 26 Jan 2022 at 10:11, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2022-Jan-25, Michael Harris wrote:
>
> > We've recently updated our application to PG 14.1, and in the test instance we
> > have started to see some alarming undetected deadlocks.
>
> This is indeed suspicious / worrisome / curious.
>
> What version were you using previously?
>
> I reformatted the result sets:
>
> > An example of what we have seen is:
> >
> > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation
> > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+-----------------+---------+----------+-------------------------------+--------------------------------
> > relation | 529986 | 1842228045 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | f | f | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa
> > (1 row)
> >
> > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation
> > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-----------+--------------------------------
> > relation | 529986 | 1842228045 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | t | f | | st.ctr_table_efr_oa
> > (1 row)
> >
> > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation
> > ----------+----------+------------+------+-------+------------+---------------+---------+-------+----------+--------------------+---------+---------------------+---------+----------+-------------------------------+-----------
> > relation | 529986 | 1842231489 | | | | | | | | 75/2193719 | 2128603 | AccessExclusiveLock | f | f | 2022-01-19 00:32:32.924694+01 | st.tpd_oa
> > (1 row)
> >
> > locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart | relation
> > ---------------+----------+------------+------+-------+--------------+---------------+---------+-----------+----------+--------------------+---------+-----------------------+---------+----------+-------------------------------+-----------
> > relation | 529986 | 1842231489 | | | | | | | | 165/1941408 | 2130531 | AccessShareLock | t | f | | st.tpd_oa
> >
> > So:
> > pid 2130531 waits for an AccessShareLock on relation 1842228045, blocked by pid 2128603 which holds an AccessExclusiveLock
> > pid 2128603 waits for an AccessExclusiveLock on relation 1842231489, blocked by pid 2130531 which holds an AccessShareLock
> >
> > The queries being executed by these backends are:
> >
> > pid | query_start | state_change | wait_event_type | wait_event | state | query
> > ---------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------------------------------------------------------------------
> > 2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19 00:32:32.924413+01 | Lock | relation | active | DROP TABLE st.tpd_oa_18929
> > 2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19 00:32:32.625708+01 | Lock | relation | active | DELETE FROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868
> > (2 rows)
>
> I know of no cases in which we fail to detect a deadlock. Perhaps you
> have indeed hit a bug.
>
> > Note that there were a lot of other processes also waiting on relation
> > 1842231489 - could that be confusing the deadlock detection routine?
>
> It shouldn't.
>
> > I am also confused about the locks which are being taken out by the
> > DELETE query.
>
> Maybe the lock is already taken before the DELETE is run; do you have
> any triggers, rules, constraints, or anything? If you have seen this
> several times already, maybe a way to investigate deeper is an
> exhaustive log capture of everything that these transactions do, from
> the point they begin until they become blocked (log_statement=all).
>
> Perhaps you need to involve other concurrent transactions in order to
> cause the problem.
>
> --
> Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
> "Tiene valor aquel que admite que es un cobarde" (Fernandel)

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Lewis 2022-01-27 01:14:26 Re: Undetected Deadlock
Previous Message Merlin Moncure 2022-01-26 23:23:26 Re: Counting the number of repeated phrases in a column