Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory

From: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To: Mike Yeap <wkk1020(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgres PANIC when it could not open file in pg_logical/snapshots directory
Date: 2021-06-22 09:04:15
Message-ID: CAM+6J94d-nr-1kUqaXgttwbski_UCawRsTYheZgZfM7A7j3aPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 22 Jun 2021 at 13:32, Mike Yeap <wkk1020(at)gmail(dot)com> wrote:

> Hi all,
>
> I have a Postgres version 11.11 configured with both physical replication
> slots (for repmgr) as well as some logical replication slots (for AWS
> Database Migration Service (DMS)). This morning, the server went panic with
> the following messages found in the log file:
>
> 2021-06-22 04:56:35.314 +08 [PID=19457 application="[unknown]"
> user_name=dms database=** host(port)=**(48360)] PANIC: could not open file
> "pg_logical/snapshots/969-FD606138.snap": Operation not permitted
>
> 2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database=
> host(port)=] LOG: server process (PID 19457) was terminated by signal 6:
> Aborted
>
> 2021-06-22 04:56:35.317 +08 [PID=1752 application="" user_name= database=
> host(port)=] LOG: terminating any other active server processes
>

Are you sure there is nothing else, do you see anything in
/var/log/kern.log or dmesg logs.
i just did a small simulation of logical replication from A -> B, i
deleted one of the snapshots live, i also changed permissions to make it RO
my server did not crash at all. (pg14beta though) although i can try other
things to check at pg layer, but if something else externally has happened,
it would be difficult to reproduce.
pardon me for speculating, but
Is it network storage? did the underlying storage layer have a blip of some
kind?
are the mounts fine? are they readonly or were temporarily readonly ?
no bad hardware ?
If none of the above, did the server restart solve the issue? or is it
broken still, unable to start?

> The PG server then terminates all existing PG processes.
>
> The process with 19457 is from one of the DMS replication tasks, I have no
> clue why it suddenly couldn't open a snapshot file. I checked the server
> load and file systems and didn't find anything unusual at that time.
>
> Appreciate if you can give me some guidance on troubleshooting this issue
>
> Thanks
>
> Regards,
> Mike Yeap
>

is it crashing and dumping cores?
can you strace the postmaster on its startup to check what it going on ?

I can share my demo setup, but it would be too noisy in the thread, but can
do it later if you want.
the above assumptions are based on repmgnr and AWS do not interfere in your
primary server internals, just failover and publication.

--
Thanks,
Vijay
Mumbai, India

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Oliver Kohll 2021-06-22 09:19:57 Re: replace inside regexp_replace
Previous Message Nicolas Seinlet 2021-06-22 08:20:08 second CTE kills perf