Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory

From: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To: Mike Yeap <wkk1020(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory
Date: 2021-06-23 19:01:04
Message-ID: CAM+6J95HRkk+kaD=nuDFJHSvedu708jbTXHnHY5BEDyxR5i4Wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 22 Jun 2021 at 14:34, Vijaykumar Jain <
vijaykumarjain(dot)github(at)gmail(dot)com> wrote:

>
> On Tue, 22 Jun 2021 at 13:32, Mike Yeap <wkk1020(at)gmail(dot)com> wrote:
>
>> Hi all,
>>
>> I have a Postgres version 11.11 configured with both physical replication
>> slots (for repmgr) as well as some logical replication slots (for AWS
>> Database Migration Service (DMS)). This morning, the server went panic with
>> the following messages found in the log file:
>>
>> 2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]"
>> user_name=dms database=** host(port)=**(48360)] PANIC: could not open file
>> "pg_logical/snapshots/969-FD606138.snap": Operation not permitted
>>
>> 2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database=
>> host(port)=] LOG: server process (PID 19457) was terminated by signal 6:
>> Aborted
>>
>> 2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database=
>> host(port)=] LOG: terminating any other active server processes
>>
>
>
I just tried the below case, when a running logical replication is denied
access to the snapshot folder via chattr +i to make the pg crash at
publisher.
Hence I was speculating, something was occured at the filesystem, this may
not be related, but just trying to get a scenario to simulate a repeatable
crash.

A (port 5001 publisher) -> logical replication all tables -> B(port 5002
subscriber), all working fine.

postgres(at)db:~/playground/logical_replication$ psql -p 5001 -c 'select
count(1) from t;'
count
-------
1000
(1 row)

postgres(at)db:~/playground/logical_replication$ psql -p 5002 -c 'select
count(1) from t;'
count
-------
1000
(1 row)

now i change file attributes of pg_logical folder at the publisher

root(at)db:/home/postgres/playground/logical_replication/db1/pg_logical# cd
/home/postgres/playground/logical_replication/db1/pg_logical
root(at)db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
--------------e----- ./mappings
--------------e----- ./snapshots
--------------e----- ./replorigin_checkpoint
root(at)db:/home/postgres/playground/logical_replication/db1/pg_logical#
chattr -R +i * # do not allow mod
root(at)db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
----i---------e----- ./mappings
----i---------e----- ./snapshots
----i---------e----- ./replorigin_checkpoint

psql -p 5001 -c 'delete from t; checkpoint;' # and crash

2021-06-24 00:22:36.998 IST [2899] LOG: could not remove file
"pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG: could not remove file
"pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG: could not remove file
"pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:36.998 IST [2899] LOG: could not remove file
"pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.003 IST [2899] PANIC: could not open file
"pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.092 IST [2897] LOG: checkpointer process (PID 2899)
was terminated by signal 6: Aborted
2021-06-24 00:22:37.092 IST [2897] LOG: terminating any other active
server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
2021-06-24 00:22:37.093 IST [2897] LOG: all server processes terminated;
reinitializing
postgres(at)db:~/playground/logical_replication$ 2021-06-24 00:22:37.100 IST
[2920] LOG: database system was interrupted; last known up at 2021-06-24
00:20:54 IST
2021-06-24 00:22:37.100 IST [2921] FATAL: the database system is in
recovery mode
2021-06-24 00:22:37.210 IST [2920] LOG: could not open file
"./pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG: could not open file
"./pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG: could not open file
"./pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG: could not open file
"./pg_logical/snapshots/0-16EF1E0.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG: could not open file
"./pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.210 IST [2920] LOG: could not open file
"./pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.212 IST [2920] LOG: database system was not properly
shut down; automatic recovery in progress
2021-06-24 00:22:37.214 IST [2920] LOG: redo starts at 0/16E72B8
2021-06-24 00:22:37.214 IST [2920] LOG: invalid record length at
0/16EF218: wanted 24, got 0
2021-06-24 00:22:37.214 IST [2920] LOG: redo done at 0/16EF1E0 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-06-24 00:22:37.217 IST [2920] LOG: could not remove file
"pg_logical/snapshots/0-16CE9D8.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG: could not remove file
"pg_logical/snapshots/0-16E72B8.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG: could not remove file
"pg_logical/snapshots/0-16CE9A0.snap": Operation not permitted
2021-06-24 00:22:37.217 IST [2920] LOG: could not remove file
"pg_logical/snapshots/0-16E7398.snap": Operation not permitted
2021-06-24 00:22:37.219 IST [2920] PANIC: could not open file
"pg_logical/replorigin_checkpoint": Operation not permitted
2021-06-24 00:22:37.309 IST [2897] LOG: startup process (PID 2920) was
terminated by signal 6: Aborted
2021-06-24 00:22:37.309 IST [2897] LOG: aborting startup due to startup
process failure
2021-06-24 00:22:37.310 IST [2897] LOG: database system is shut down

*************************
now i restore the attr back for pg_logical and restart the db

root(at)db:/home/postgres/playground/logical_replication/db1/pg_logical#
chattr -R -i *
root(at)db:/home/postgres/playground/logical_replication/db1/pg_logical# lsattr
--------------e----- ./mappings
--------------e----- ./replorigin_checkpoint.tmp
--------------e----- ./snapshots
--------------e----- ./replorigin_checkpoint

2021-06-24 00:24:00.026 IST [2947] LOG: database system is ready to accept
connections
done
server started
postgres(at)db:~/playground/logical_replication$ 2021-06-24 00:24:02.357 IST
[2957] LOG: starting logical decoding for slot "mysub"
2021-06-24 00:24:02.357 IST [2957] DETAIL: Streaming transactions
committing after 0/16EF1E0, reading WAL from 0/16EF1E0.
2021-06-24 00:24:02.357 IST [2957] STATEMENT: START_REPLICATION SLOT
"mysub" LOGICAL 0/16E7398 (proto_version '2', publication_names '"mypub"')
2021-06-24 00:24:02.357 IST [2957] LOG: logical decoding found consistent
point at 0/16EF1E0
2021-06-24 00:24:02.357 IST [2957] DETAIL: There are no running
transactions.
2021-06-24 00:24:02.357 IST [2957] STATEMENT: START_REPLICATION SLOT
"mysub" LOGICAL 0/16E7398 (proto_version '2', publication_names '"mypub"')

I might be wrong, but if you can put up monitoring around your underlying
storage, that may help.
I do not know how it works on the cloud, but I used to do it over the
hypervisor layer on vmware and emit metrics for the same to relate errors
with crash.

--
Thanks,
Vijay
Mumbai, India

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vijaykumar Jain 2021-06-23 19:06:25 Re: Is there something similar like flashback query from Oracle planned for PostgreSQL
Previous Message Dirk Krautschick 2021-06-23 18:53:56 Is there something similar like flashback query from Oracle planned for PostgreSQL