Re: Assertion failure on hot standby

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Assertion failure on hot standby
Date: 2010-11-25 07:59:48
Message-ID: AANLkTikGGKPFPBGSVxF2C2oPsMcopchZjUjLU129vrg8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 24, 2010 at 1:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Hi,
>
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg01303.php
>
> When I did unusual operations (e.g., suspend bgwriter by gdb,
> pgbench -i and issue txid_current many times) on the master
> in order to try to reproduce the above HS error, I encountered
> the following assertion error.
>
> Since I compiled the standby postgres with WAL_DEBUG and
> ran it with wal_debug = on, all the replayed WAL records were
> logged.
>
> ------------------------
> sby LOG:  REDO @ 0/134C0490; LSN 0/134C04D0: prev 0/134C0450; xid
> 23253; len 32: Transaction - commit: 2010-11-24 12:15:02.315634+09
> sby LOG:  REDO @ 0/134C04D0; LSN 0/134C0510: prev 0/134C0490; xid
> 23254; len 32: Transaction - commit: 2010-11-24 12:15:02.325252+09
> sby LOG:  consistent recovery state reached at 0/134C0510
> sby LOG:  REDO @ 0/134C0510; LSN 0/134C0550: prev 0/134C04D0; xid
> 23255; len 32: Transaction - commit: 2010-11-24 12:15:09.224343+09
> sby LOG:  REDO @ 0/134C0550; LSN 0/134C0580: prev 0/134C0510; xid 0;
> len 16: Standby - AccessExclusive locks: xid 0 db 11910 rel 16409
> sby LOG:  REDO @ 0/134C0580; LSN 0/134C05B8: prev 0/134C0550; xid 0;
> len 20: Standby -  running xacts: nextXid 23256 latestCompletedXid
> 23255 oldestRunningXid 23256
> TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File:
> "twophase.c", Line: 1209)
> sby LOG:  database system is ready to accept read only connections
> sby LOG:  startup process (PID 32666) was terminated by signal 6: Aborted
> sby LOG:  terminating any other active server processes
> ------------------------
>
> Does anyone know what the cause of the problem is?

I was able to reproduce this problem. This happens because CHECKPOINT
can write the WAL record indicating that the transaction with XID = 0 has taken
the AccessExclusive lock. This WAL record causes that assertion failure in the
standby.

Here is the procedure to reproduce the problem:

-------------------
1. Execute "DROP TABLE" and suspend the execution before calling

RemoveRelations -> LockRelationOid -> LockAcquire ->
LockAcquireExtended -> LogAccessExclusiveLock

by, for example, using gdb.

2. While "DROP TABLE" is being suspended, execute CHECKPOINT.
This CHECKPOINT will generate the above-mentioned WAL record.
-------------------

To solve the problem, ISTM that XID should be assigned before the information
about AccessExclusive lock becomes visible to another process. Or CHECKPOINT
(i.e., GetRunningTransactionLocks) should ignore the locks with XID = 0.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shigeru HANADA 2010-11-25 08:12:44 SQL/MED - file_fdw
Previous Message Itagaki Takahiro 2010-11-25 07:52:51 Re: format() with embedded to_char() formatter