Re: stress test for parallel workers

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: stress test for parallel workers
Date: 2019-07-24 00:33:43
Message-ID: 20190724003343.GV22387@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when this
> > core was dumped (due to having not touched it since scheduling transition to
> > this VM last week).
> >
> > I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> > failing to find log output, I ran df right after the failure.

I meant it wasn't a trivial error on my part of failing to drop the previously
loaded DB instance. It occured to me to check inodes, which can also cause
ENOSPC. This is mkfs -T largefile, so running out of inodes is not an
impossibility. But seems an unlikely culprit, unless something made tens of
thousands of (small) files.

[pryzbyj(at)alextelsasrv01 ~]$ df -i /var/lib/pgsql
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/data-postgres
65536 5605 59931 9% /var/lib/pgsql

> Ok, cool, so the ENOSPC thing we understand, and the postmaster death
> thing is probably something entirely different. Which brings us to
> the question: what is killing your postmaster or causing it to exit
> silently and unexpectedly, but leaving no trace in any operating
> system log? You mentioned that you couldn't see any signs of the OOM
> killer. Are you in a situation to test an OOM failure so you can
> confirm what that looks like on your system?

$ command time -v python -c "'x'*4999999999" |wc
Traceback (most recent call last):
File "<string>", line 1, in <module>
MemoryError
Command exited with non-zero status 1
...
Maximum resident set size (kbytes): 4276

$ dmesg
...
Out of memory: Kill process 10665 (python) score 478 or sacrifice child
Killed process 10665, UID 503, (python) total-vm:4024260kB, anon-rss:3845756kB, file-rss:1624kB

I wouldn't burn too much more time on it until I can reproduce it. The
failures were all during pg_restore, so checkpointer would've been very busy.
It seems possible it for it to notice ENOSPC before workers...which would be
fsyncing WAL, where checkpointer is fsyncing data.

> Admittedly it is quite hard for to distinguish between a web browser
> and a program designed to eat memory as fast as possible...

Browsers making lots of progress here but still clearly 2nd place.

Justin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-07-24 00:49:05 Re: Fetching timeline during recovery
Previous Message Steven Pousty 2019-07-24 00:24:11 Re: SQL/JSON path issues/questions