ERROR could not access transaction/Could not open file pg_commit_ts

From: Jeremy Finzel <finzelj(at)gmail(dot)com>
To: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: ERROR could not access transaction/Could not open file pg_commit_ts
Date: 2018-03-09 16:43:20
Message-ID: CAMa1XUghTLmd7sbEfiJ2HOVZLfLh7LPZZBR+K6PjHwLOfqUrHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello -

Here is our cluster setup:

cluster_a 9.5.11 Ubuntu 16.04.4 LTS
--> cluster_b (streamer) 9.5.11 Ubuntu 16.04.4 LTS
--> cluster_c (streamer) 9.5.11 Ubuntu 16.04.4 LTS

Very recently, we started seeing these errors when running a query on a
specific table on the streamer:

2018-03-09 08:28:16.280
CST,"uname","foo",18692,"0.0.0.0:0",5aa29292.4904,4,"SELECT",2018-03-09
07:56:34 CST,18/15992,0,*ERROR*,58P01,"*could not access status of
transaction 1035047007*","*Could not open file ""pg_commit_ts/9A45*"": No
such file or directory."

A little history on the cluster:

- The most recent change we made was a point release upgrade from 9.5.5
to 9.5.11 on the master, and 9.5.9 to 9.5.11 for the 2 streamers
- It is a very high WAL traffic reporting system.
- We actually have synchronous_commit set to off. It's possible this
could have bitten us and we are just now seeing issues, however there have
been no crashes since the table in question was created.
- We have run pg_repack on many tables on this cluster, but that also
has not happened since over a month
- We had a similar error of missing pg_commit_ts file over a year ago
after an actual crash. We had serious issues getting the cluster to start,
and had to resort to recreating the missing pg_commit_ts with null bytes
(IIRC, we had a snapshot of the system which still showed the file), which
worked but left us questioning what really caused the issue.

The table that is causing the error has been in production and used fine
since 2/15/2018 when it was created. It is fed by pglogical replication (v.
2.1.1 on subscriber) from a system running 9.6.1 and pglogical v. 1.2.1.
The point release upgrade from earlier 9.5 did take place *after* this.

However, we *only* just started seeing errors in the past 12 hours. The
table was autovacuumed on master at 2018-03-08 18:18:15.532137-06, which
was about 3 hours before the first user query errored, however, I saw that
2 hours after the autovac, there was another user query that worked
successfully on the table. Not sure if related?

Any insight/ideas would be much appreciated!

Thanks,
Jeremy

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2018-03-09 16:54:06 Re: Feature request: min/max for macaddr type
Previous Message Tom Lane 2018-03-09 16:43:00 Re: pg/tcl performance related