From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Subject: | Re: stress test for parallel workers |
Date: | 2019-07-23 17:42:22 |
Message-ID: | 20190723174222.GO22387@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though. Did you try syslog?)
Nothing in /var/log/messages (nor dmesg ring).
I enabled abrtd while trying to reproduce it last week. Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:
(gdb) bt
#0 0x000000395be32495 in raise () from /lib64/libc.so.6
#1 0x000000395be33c75 in abort () from /lib64/libc.so.6
#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
#7 0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at bootstrap.c:451
#8 0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5337
#9 0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2867
#10 <signal handler called>
#11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1671
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228
#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>
elevel = 22
oldcontext = 0x27e15d0
econtext = 0x0
__func__ = "errfinish"
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
save_errno = <value optimized out>
tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
path = 0x9c4300 "pg_logical/replorigin_checkpoint"
tmpfd = 64
i = <value optimized out>
magic = 307747550
crc = 4294967295
__func__ = "CheckPointReplicationOrigin"
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792, flags=128) at xlog.c:9150
No locals.
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
shutdown = false
checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2141308, nextOid = 496731439, nextMulti = 1, nextMultiOffset = 0,
oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time = 1563781930, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 2141308}
recptr = <value optimized out>
_logSegNo = <value optimized out>
Insert = <value optimized out>
freespace = <value optimized out>
PriorRedoPtr = <value optimized out>
curInsert = <value optimized out>
last_important_lsn = <value optimized out>
vxids = 0x280afb8
nvxids = 0
__func__ = "CreateCheckPoint"
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
ckpt_performed = false
do_restartpoint = <value optimized out>
flags = 128
do_checkpoint = <value optimized out>
now = 1563781930
elapsed_secs = <value optimized out>
cur_timeout = <value optimized out>
rc = <value optimized out>
local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0, 140728912407216, 140728912407224, -1669940128812603322, 1670605924426606662}, __mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103, 0, 246358747096, 140728912407296, 140446084917816, 140446078556040, 9083146, 0, 246346239061, 140446078556040, 140447207471460, 0, 140447207471424, 140446084917816, 0,
7864320}}}}
checkpointer_context = 0x27e15d0
__func__ = "CheckpointerMain"
Supposedly it's trying to do this:
| ereport(PANIC,
| (errcode_for_file_access(),
| errmsg("could not write to file \"%s\": %m",
| tmppath)));
And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective). Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.
Justin
From | Date | Subject | |
---|---|---|---|
Next Message | Jeevan Ladhe | 2019-07-23 17:48:50 | Re: block-level incremental backup |
Previous Message | Alvaro Herrera | 2019-07-23 17:31:53 | Re: [bug fix] Produce a crash dump before main() on Windows |