From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | infinite loop in parallel hash joins / DSA / get_best_segment |
Date: | 2018-09-16 22:38:10 |
Message-ID: | 194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:
1) A parallel worker" process is consuming 100% CPU, with per for
reporting profile like this:
34.66% postgres [.] get_segment_by_index
29.44% postgres [.] get_best_segment
29.22% postgres [.] unlink_segment.isra.2
6.66% postgres [.] fls
0.02% [unknown] [k] 0xffffffffb10014b0
So all the time seems to be spent within get_best_segment.
2) The backtrace looks like this (full backtrace attached):
#0 0x0000561a748c4f89 in get_segment_by_index
#1 0x0000561a748c5653 in get_best_segment
#2 0x0000561a748c67a9 in dsa_allocate_extended
#3 0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
#4 0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
#5 0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
#6 ExecHashJoinImpl
#7 ExecParallelHashJoin
#8 ExecProcNode
...
3) The infinite loop seems to be pretty obvious - after setting
breakpoint on get_segment_by_index we get this:
Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.
Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.
Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.
That is, we call the function with the same index over and over.
Why is that? Well:
(gdb) print *area->segment_maps[3].header
$1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
next = 3, bin = 0, freed = false}
So, we loop forever.
I don't know what exactly are the triggering conditions here. I've only
ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
table and work_mem set to 8MB and query #4. And it seems I can reproduce
it pretty reliably.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
explain.log | text/x-log | 19.6 KB |
backtrace.txt | text/plain | 8.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2018-09-16 22:42:34 | Re: infinite loop in parallel hash joins / DSA / get_best_segment |
Previous Message | Thomas Munro | 2018-09-16 22:23:35 | Re: Collation versioning |