Quick Links

Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject:	Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Date:	2015-10-05 17:10:09
Message-ID:	CAMkU=1zxgAf846Lrjk9z1iBAP3MrwHvr9p-XLvC1-N9_uNoKfw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Mon, Oct 5, 2015 at 1:15 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

> On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com>
> > wrote:
> >>
> >>
> >>
> >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim(dot)boguk(at)gmail(dot)com>
> wrote:
> >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting
> for
> >> >> something.
> >> >> And query waiting for AccessShareLock on the same relation.
> >> >>
> >> > gdb backtrace from stuck startup process:
> >> > (gdb) bt
> >> > #0 0x00007f04ad862633 in select () from
> /lib/x86_64-linux-gnu/libc.so.6
> >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at
> >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53
> >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171
> >> > #3 ResolveRecoveryConflictWithVirtualXIDs
> >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0)
> at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232
> >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0,
> >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191
> >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80,
> >> > lsn=107351881751648) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329
> >>
> >> This backtrace is not indicating that this process is waiting on a
> >> relation lock, it is resolving a recovery conflict while removing
> tuples,
> >> killing the virtual transaction depending on if
> max_standby_streaming_delay
> >> or max_standby_archive_delay are set if the conflict gets longer. Did
> you
> >> change the default of those parameters, which is 30s, to -1? This would
> mean
> >> that the standby waits indefinitely.
> >
> > While setting it to -1 gives the startup process permission to wait
> > indefinitely for another back-end which is doing something, I don't think
> > that that means it should have permission to deadlock indefinitely. A
> > deadlock is a different kind of thing that "someone started a transaction
> > and then left on vacation for a month without closing it"
>
> Looking at that more closely... We know that a startup process is the
> only process in a standby that can take an exclusive lock on a
> relation, per LockAcquireExtended in lock.c and that this lock is
> taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by
> Maxim indicates that the deadlock is showing up in heap_xlog_clean,
> which may happen in the case of a VACUUM for example. It seems to me
> that we could have a deadlock if some WAL records had their order
> swapped. For example, the EXCLUSIVE lock is replayed first and the
> cleanup of relation pages happens before the lock has been released,
> freezing the other backend using pg_relation_size on a standby waiting
> for the access share lock indefinitely. This would indeed prevent
> recovery conflict to work properly, because we should not hold an
> exclusive lock on the relation while replaying heap_xlog_clean. Hence
> it seems to me that we may have an issue with 68a2e52 (XLogInsert
> multi-slot), and it has been introduced in 9.4.
>

I was assuming that the table being AccessExclusive locked and the table
being VACUUMed were two different tables, which were done in two different
processes on the master, which means that no re-ordering of WAL records
would be necessary to hit this. Although I also haven't yet been able to
reproduce the deadlock under this assumption.

Cheers,

Jeff

In response to

Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. at 2015-10-05 08:15:15 from Michael Paquier

Responses

Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. at 2015-10-06 08:43:54 from Michael Paquier

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Michael Paquier	2015-10-06 08:43:54	Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Previous Message	John R Pierce	2015-10-05 16:49:18	Re: BUG #13663: uuid-ossp in multiple schemas same database