Re: BUG #16331: segfault in checkpointer with full disk

From: Julien Rouhaud <rjuju123(at)gmail(dot)com>
To: Jozef Mlich <jmlich83(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16331: segfault in checkpointer with full disk
Date: 2020-04-01 17:25:24
Message-ID: CAOBaU_a0-FkNp4YHO_7nN7=NDN2R_xb-Ya-e3w9bB1SHEstYCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Apr 1, 2020 at 11:51 AM Jozef Mlich <jmlich83(at)gmail(dot)com> wrote:
>
> On Wed, 2020-04-01 at 11:04 +0200, Julien Rouhaud wrote:
> > Hi,
> >
> > On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form
> > wrote:
> > >
> > > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7
> > > (from
> > > yum.postgresql.org) I am using multiple extensions (cstore,
> > > postgres_fdw,
> > > pgcrypto,dblink, etc.). It seems crash is related to disk run out
> > > of space
> > > (I am using separate partion for / and for /var/lib/pgsql). It
> > > occurs few
> > > times a day. According to backtrace it seems to be related to
> > > checkpointer.
> > > Replication is not configured.
> > >
> > >
> > > [New LWP 26290]
> > > [Thread debugging using libthread_db enabled]
> > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > Core was generated by `postgres:
> > > checkpointer
> > > '.
> > > Program terminated with signal 6, Aborted.
> > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig(at)entry=6) at
> > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> > > 55 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> > >
> > > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)):
> > > #0 0x00007fe4604c1207 in __GI_raise (sig=sig(at)entry=6) at
> > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> > > resultvar = 0
> > > pid = 26290
> > > selftid = 26290
> > > #1 0x00007fe4604c28f8 in __GI_abort () at abort.c:90
> > > save_stage = 2
> > > act = {__sigaction_handler = {sa_handler = 0x0,
> > > sa_sigaction = 0x0},
> > > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717,
> > > 39808819211026438, 20126216749056, 70394513997832, 9268713,
> > > 70403103920719,
> > > 17316096998686159616, 20134806683648, 140618848608704,
> > > 140618848592800}},
> > > sa_flags = 1615828275, sa_restorer = 0x0}
> > > sigs = {__val = {32, 0 <repeats 15 times>}}
> > > #2 0x000000000087840a in errfinish (dummy=<optimized out>) at
> > > elog.c:552
> > > edata = 0xd47040 <errordata>
> > > elevel = 22
> > > oldcontext = 0x171a6d0
> > > econtext = 0x0
> > > __func__ = "errfinish"
> > > #3 0x0000000000706b24 in CheckPointReplicationOrigin () at
> > > origin.c:562
> > > tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp"
> > > path = 0x9e6fd0 "pg_logical/replorigin_checkpoint"
> > > tmpfd = <optimized out>
> > > i = <optimized out>
> > > magic = 307747550
> > > crc = 4294967295
> > > __func__ = "CheckPointReplicationOrigin"
> >
> > That's not a bug (nor a segfault) but the expected behavior if the
> > checkpointer is not able to do its work. As data durability can't be
> > guaranteed in such case, the checkpointer raises a PANIC level
> > message, which raises an abort so that the whole instance do an
> > emergency restart cycle.
> >
> > Do you have monitoring for this filesystem? Do you see spikes in
> > disk usage or other strange behavior?
>
> Then it is clear. Thanks for explanation and applogize for false bug
> report.
>
> I have probably misunderstood how is segfault distinguished from abort.
> I need to fix my kernel.core_pattern script.
>
> In attachment is screenshot from monitoring grafana with information
> about space on /var/lib/pgsql partition.

The main filesystem is full or almost full most of the time? That's
unfortunately a good way to trigger that kind of outage. Is that
because most of the data is on a different tablespace? Even in that
case you need to ensure that you still have at least a reasonable
amount of free space.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Fujii Masao 2020-04-01 17:51:36 Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)
Previous Message Jehan-Guillaume de Rorthais 2020-04-01 16:17:35 Re: [BUG] non archived WAL removed during production crash recovery