Re: BUG #14820: Standby crash with "could not access status of transaction" (for track_commit_timestamp)

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14820: Standby crash with "could not access status of transaction" (for track_commit_timestamp)
Date: 2018-05-23 21:20:23
Message-ID: 4744cb5b-8962-8f10-f729-7cfeba807fcb@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 9/20/17 02:26, matioli(dot)matheus(at)gmail(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 14820
> ...
> 2017-09-20 03:46:36.598 -03 [50892]: [3-1] LOG: entering standby mode
> 2017-09-20 03:46:36.607 -03 [50892]: [4-1] LOG: redo starts at
> 23AF/47E40620
> 2017-09-20 03:46:36.621 -03 [50892]: [5-1] FATAL: could not access
> status of transaction 1556437428
> 2017-09-20 03:46:36.621 -03 [50892]: [6-1] DETAIL: Could not read from
> file "pg_commit_ts/E7FB" at offset 229376: Success.
> 2017-09-20 03:46:36.621 -03 [50892]: [7-1] CONTEXT: xlog redo at
> 23AF/481CE818 for Transaction/COMMIT: 2017-09-20 03:17:18.239579-03
> 2017-09-20 03:46:36.623 -03 [50889]: [3-1] LOG: startup process (PID
> 50892) exited with exit code 1
> 2017-09-20 03:46:36.623 -03 [50889]: [4-1] LOG: terminating any other
> active server processes
> 2017-09-20 03:46:36.721 -03 [50889]: [5-1] LOG: database system is shut
> down
>
> Now any attempt to start any of the standby servers gives the above errors
> (the transaction and offset are exact the same). Do you think it may be a
> bug? Notice that I loaded `pglogical`, which I don't think is the one
> causing trouble, but it may be an important information. It seems to be
> something related to `track_commit_timestamp` or using
> `wal_level='logical'`, but I'm not sure how to debug it.
>
> I have checked and the file `pg_commit_ts/E7FB` is on the data directory
> with 229376, so looks like it is trying to read the file past its end, hence
> the above error.

I recently saw a problem that looked very similar to this, seems worth
reporting on the same thread.

PostgreSQL version: 10.3
Operating system: Linux

On startup, while attempting recovery, PostgreSQL failed to startup with
the following error:

FATAL: could not access status of transaction 68544569
DETAIL: Could not read from file "pg_commit_ts/0A37" at offset 106496:
Success.
CONTEXT: WAL redo at 70E/E5FE2030 for Transaction/COMMIT: 2018-xx-xx
xx:xx:xx.xxxxxx-xx; subxacts: 68544570
LOG: startup process (PID 100292) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down

This is on a primary, not a standby.  track_commit_timestamp is on,
wal_level is replica and there are no shared_preload_libraries.

The pg_commit_ts directory was actually empty when recovery started and
in the process of recovery PostgreSQL recreated the 0A37 file.  However
it consistently created a 106496 byte file (entirely filled with
0/nulls) and then threw the error above when trying to read beyond the
end.  If we removed the 0A37 file and restarted then it would again
recreate the 106496 byte file and error out in the same way.

Here's the WAL it was trying to apply at failure, with preceding and
following entries:

rmgr: Heap        len (rec/tot):     71/    71, tx:   68544569, lsn:
70E/E5FE1FD0, prev 70E/E5FE1F98, desc: HOT_UPDATE off 2 xmax 68544569 ;
new off 10 xmax 0, blkref #0: rel 1663/3850018/16485 blk 295474
rmgr: Transaction len (rec/tot):     46/    46, tx:   68544569, lsn:
70E/E5FE2030, prev 70E/E5FE1FD0, desc: COMMIT 2018-xx-xx xx:xx:xx.xxxxxx
UTC; subxacts: 68544570
rmgr: Heap        len (rec/tot):     54/    54, tx:   68539960, lsn:
70E/E5FE2060, prev 70E/E5FE2030, desc: LOCK off 12: xid 68539960: flags
0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/11038913/16866 blk 6135

Unfortunately the WAL files from the time of the issue are no longer
available, so I'm unable to go back and do any additional analysis of
the original crash.  (Like looking more at those transactions.)  But I
wanted to at least get a record over here of what happened.

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message shao bret 2018-05-24 01:41:32 答复: BUG #15207: In crash recovery, SyncDataDirectory() will fall into recursion trap
Previous Message Tom Lane 2018-05-23 20:39:47 Re: Primary key error in INFORMATION_SCHEMA views