From: | Jeremy Schneider <schnjere(at)amazon(dot)com> |
---|---|
To: | <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #14820: Standby crash with "could not access status of transaction" (for track_commit_timestamp) |
Date: | 2018-06-06 23:27:08 |
Message-ID: | d96ae60a-7d40-c9b4-719d-fbbb61622606@amazon.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 5/23/18 14:20, Jeremy Schneider wrote:
> On 9/20/17 02:26, matioli(dot)matheus(at)gmail(dot)com wrote:
>> I have checked and the file `pg_commit_ts/E7FB` is on the data directory
>> with 229376, so looks like it is trying to read the file past its end, hence
>> the above error.
>
> I recently saw a problem that looked very similar to this, seems worth
> reporting on the same thread.
>
> PostgreSQL version: 10.3
> Operating system: Linux
>
> On startup, while attempting recovery, PostgreSQL failed to startup with
> the following error:
>
> FATAL: could not access status of transaction 68544569
> DETAIL: Could not read from file "pg_commit_ts/0A37" at offset 106496:
> Success.
> CONTEXT: WAL redo at 70E/E5FE2030 for Transaction/COMMIT: 2018-xx-xx
> xx:xx:xx.xxxxxx-xx; subxacts: 68544570
> LOG: startup process (PID 100292) exited with exit code 1
> LOG: aborting startup due to startup process failure
> LOG: database system is shut down
>
> This is on a primary, not a standby. track_commit_timestamp is on,
> wal_level is replica and there are no shared_preload_libraries.
>
> The pg_commit_ts directory was actually empty when recovery started and
> in the process of recovery PostgreSQL recreated the 0A37 file. However
> it consistently created a 106496 byte file (entirely filled with
> 0/nulls) and then threw the error above when trying to read beyond the
> end. If we removed the 0A37 file and restarted then it would again
> recreate the 106496 byte file and error out in the same way.
>
> Here's the WAL it was trying to apply at failure, with preceding and
> following entries:
>
> rmgr: Heap len (rec/tot): 71/ 71, tx: 68544569, lsn:
> 70E/E5FE1FD0, prev 70E/E5FE1F98, desc: HOT_UPDATE off 2 xmax 68544569 ;
> new off 10 xmax 0, blkref #0: rel 1663/3850018/16485 blk 295474
> rmgr: Transaction len (rec/tot): 46/ 46, tx: 68544569, lsn:
> 70E/E5FE2030, prev 70E/E5FE1FD0, desc: COMMIT 2018-xx-xx xx:xx:xx.xxxxxx
> UTC; subxacts: 68544570
> rmgr: Heap len (rec/tot): 54/ 54, tx: 68539960, lsn:
> 70E/E5FE2060, prev 70E/E5FE2030, desc: LOCK off 12: xid 68539960: flags
> 0 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/11038913/16866 blk 6135
>
> Unfortunately the WAL files from the time of the issue are no longer
> available, so I'm unable to go back and do any additional analysis of
> the original crash. (Like looking more at those transactions.) But I
> wanted to at least get a record over here of what happened.
I was able to get a little more information about this.
We can see that:
* a vacuum was running on TABLE1 at the time of the problem
* there were many transactions containing sub-transactions (inserting
into TABLE1) at the time of the problem
*** vacuum activity was seen on TABLE1 concurrently for quite some time
before the problem
* nearly all DB activity including a stream of inserts into TABLE2
stopped abruptly at TIME1
* at TIME1 there were 9 xacts in flight. 3 of these aborted around this
time.
* after TIME1, one new transaction started, inserting into TABLE3 (with
subxact in TABLE1). that xact committed at 06:04:04.015169 UTC.
* the vacuum on TABLE1 continue just past the new transaction commit. it
was the last observed activity in the DB before crash.
* underlying I/O system looks healthy during this entire period.
Not enough information to get a root cause, but I wanted to record it in
case someone hits something similar in the future.
-Jeremy
--
Jeremy Schneider
Database Engineer
Amazon Web Services
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2018-06-07 01:00:22 | Re: BUG #15225: [XX000] ERROR: invalid DSA memory alloc request size 1073741824 / Where: parallel worker |
Previous Message | Alvaro Herrera | 2018-06-06 18:52:52 | Re: BUG #15221: Could not open file "pg_subtrans/0013" |