Re: Memory leak (possibly connected to postgis) leading to server crash

From: Roman Cervenak <roman(at)cervenak(dot)info>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Memory leak (possibly connected to postgis) leading to server crash
Date: 2019-12-06 11:46:44
Message-ID: CAGjExY0sFbx+JytMkY-h8=NWqD_f0DOwB8QXrQ4SqJ+BttoX6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Yes, it was killed by oom killer:

[2037990.376427]
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/system-postgresql.slice,task=postgres,pid=52059,uid=111
[2037990.376433] Out of memory: Kill process 52059 (postgres) score 294 or
sacrifice child
[2037990.384186] Killed process 52059 (postgres) total-vm:17508832kB,
anon-rss:4309296kB, file-rss:108kB, shmem-rss:12641580kB
[2037990.516504] oom_reaper: reaped process 52059 (postgres), now
anon-rss:0kB, file-rss:0kB, shmem-rss:12641580kB

(full dmesg.log attached, if it is interesting; there are more postgres
backends visible, but they were inactive at the time)

I can try the gdb dump next time I will see it. But I cannot imagine giving
you reproducible case - it is 500 GB proprietary database, and without it,
queries would be hardly useful, I presume? I can try to make "sample" with
generated data and find out if I can reproduce the issue with my queries
that way, but that will be quite time consuming.

Would it help to dump the memory of the backend process and deliver the
dump (by some private channel) to somebody to identify who is consuming all
that memory? (that is the usual drill in windows)

On Fri, Dec 6, 2019 at 12:02 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On Fri, Dec 06, 2019 at 10:22:42AM +0100, Roman Cervenak wrote:
> >Hello,
> >this may be a serious issue and I am not very experienced with reporting
> >this kind of stuff, I will try my best. I believe there may be a memory
> >leak somewhere in postgres/postgis, which can eat up all the available
> >memory, leading to postgres crash (signal 9 kill of backend, postmaster
> >terminating others, recovery).
> >
> >My setup and workload:
> >Postgres 12.1 (Ubuntu 12.1-1.pgdg18.04+1) installed via apt on Ubuntu
> >server 18.04 LTS, in VM in Azure cloud.
> >PostGIS version: 3.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
> >VM has 8cores, 56 GB RAM, 7 TB RAID from managed disks (azure network
> >storage)
> >
> >shared_buffers = 12GB
> >work_mem = 256MB
> >maintenance_work_mem = 2GB
> >
> >I was running 8 clients (on different VMs, over network) using the
> >database, batch processing geographic stuff. Each worker is running one
> >query at a time, so pretty much 8 parallel queries 100% of the time.
> >Queries themselves are fairly short (5-60 seconds), SELECTing rows by
> >geometry index (GIST) and doing stuff like ST_SimplifyPreserveTopology,
> >ST_Buffer, ST_Union on them, essentially using all 8 cores to 100%. There
> >is a lot of things to process, so this was going on for maybe 12 hours,
> >when I noticed (in "htop") that memory usage is unusually high, somewhere
> >about 50 GB. It was suspicious, because it is way higher than it should be
> >with 12 GB shared buffers and 256MB work_mem with 8 clients, and it
> >continued to grow slowly (I could see numbers in RES column in htop slowly
> >rise for all backends). Until eventually:
> >
> >2019-12-06 00:04:24 UTC [21487-8] LOG: server process (PID 52059) was
> >terminated by signal 9: Killed
> >2019-12-06 00:04:24 UTC [21487-10] LOG: terminating any other active
> >server processes
> >2019-12-06 00:04:24 UTC [91091-1] WARNING: terminating connection
> because
> >of crash of another server process
> >2019-12-06 00:04:24 UTC [91091-2] DETAIL: The postmaster has commanded
> >this server process to roll back the current transaction and exit, because
> >another server process exited abnormally and possibly corrupted shared
> >memory.
> >...
> >FATAL: the database system is in recovery mode
> >...
> >
> >I realize, this is probably not enough for you. But I will be running this
> >workload again soon, so if this repeats, what should I do to help you
> >diagnose this?
>
> The kill was initiated by oom killer, I suppose? It might be interesting
> to see the message written to dmesg by it, it usually includes info
> about how much memory was used by the backend etc.
>
> The other thing that might be useful is dump of memory contexts - you'll
> have to wait until the memory usage gets excessive (i.e. shortly before
> the oom killer would kick in), attach gdb to a backend and call
> MemoryContextStats on TopMemoryContext. So, something like this:
>
> $ gdb -p $PID
> (gdb) p MemoryContextStats(TopMemoryContext)
> (gdb) q
>
> which writes a bunch of info about memory contexts into the server log.
>
> But, I'm not sure this will actually help. Based on what you wrote, the
> memory stays allocated across queries. So either it's allocated in one
> of the long-lived contexts (which queries usually don't do), or it's
> allocated directly through malloc() and not through our memory context
> infrastructure (hence it'll be invisible in the context stats).
>
> I'm not particularly familiar with PostGIS, but AFAIK it's using various
> libraries, and those are naturally using malloc/free directly. So maybe
> it's not freeing the memory in some cases.
>
> What would really help is having some sort of reproducer, and/or running
> the queries with valgrind, which can detect memory leaks.
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Attachment Content-Type Size
dmesg.log application/octet-stream 75.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2019-12-06 12:41:19 Re: Memory leak (possibly connected to postgis) leading to server crash
Previous Message Tomas Vondra 2019-12-06 11:01:54 Re: Memory leak (possibly connected to postgis) leading to server crash