BUG #18005: PSQL Process hangs in parallel mode

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: ess(dot)bee59(at)gmail(dot)com
Subject: BUG #18005: PSQL Process hangs in parallel mode
Date: 2023-06-29 06:15:07
Message-ID: 18005-5cd026ae9ac57c17@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18005
Logged by: Serge Bellina
Email address: ess(dot)bee59(at)gmail(dot)com
PostgreSQL version: 14.8
Operating system: Ubuntu-2204-jammy-amd64-base
Description:

Dear postgresql team!

here additional / new informations to BUG #17990 and BUG #17999.

I am using postgresql/postgis to calculate (better "estimate") pseudo-osm
tags to enhance the route-calculation in the "Brouter" for bikers .
Each continent is calculated in a separate database (Europe,
north-america,
etc..)

For the planet the "sequential" calculation runs whithout errors, but it
needs 26 hours.

Unfortunately the "long" SQL´s (joins) used are running as mono-process
(they are only 1 or 2 % of the time running in 3 parallel processes).
That for I tried to start 2 parallel psql-processes (Europe in
psql-process
1 and the rest of the planet in psql-process 2)

The problem:
By the first 3 tests psql-process 2 "hanged" every time for hours / never
ends !!!!!

VERY STRANGE: more than 30 SQL#s are used to process each continent, but
psql-process 2 every time hanged on the same SELECT (nearly 10 selects were
processed bevore)
VERY STRANGE: the time stamps of the hanging select for asia (2023-06-26
16:34:26) is identical with the start-time of the select itself (the time
stamp of psql-process 1 for europe is at this time 2023-06-26 17:22:51.41
!!!!)

No message at all in the postgresql or Linux logs!

Just the CPU load indicated that the process was still "working".

Also after the successful termination of the psql-process 1, psql-process 2
hangs longer and longer .... / same state

Kill -9 on the psql-process 2 in that state ( 1+ 2 processses) had no
effect: just a full database restart (administrator) could stop the loop

Remark 1: the second test was done by adding th config parameter
"max_parallel_workers_per_gather = 0" (no parallel processes whithin a
SQL!), but the "hanging" problem also occured Remark 2: after the DB
restart, the previously hanging sql in psql-process 2 was restarted and
ended successfully after 50 minutes.

Many logs were collected (logs of the SQL´s, top, ps lsof, pg_activity_stat,
memory usage, ..) and could be submitted by need (I do not know the proper
way to do that)

Here some extracts:


My configuration:
==> PostgreSQL 14.8 (Ubuntu 14.8-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, 64-bit ==>
postgres(at)Ubuntu-2204-jammy-amd64-base

-------------------------------
2023-06-26 16:34:26.163755+02 ==> START TIME of psql-process 2 in the
last test
(1 row)

-- create tags for noise
-- create raw data
-- when several highways-segments are producing noise, aggregate the
noises using the "ST_Union" of the segments!
-- (better as using "sum" or "max" that do not deliver good factors)
SELECT
m.osm_id losmid, m.highway lhighway, q.highway as qhighway,
q.maxspeed_class,
case
when q.highway in ('motorway', 'motorway_link','trunk','trunk_link') and
q.maxspeed_class < 1.1 then
st_area(st_intersection(m.way, ST_Union( q.way))) /
st_area(m.way)
when q.highway in ('motorway', 'motorway_link','trunk','trunk_link')
then
st_area(st_intersection(m.way, ST_Union( q.way))) / (1.5 *
st_area(m.way))
when q.highway in ('primary','primary_link')
and
q.maxspeed_class < 2.1 then
st_area(st_intersection(m.way, ST_Union( q.way))) / (2 *
st_area(m.way))
when q.highway in ('primary','primary_link') then
st_area(st_intersection(m.way, ST_Union( q.way))) / (3 *
st_area(m.way))
when q.highway in ('secondary') and
q.maxspeed_class < 2.1 then
st_area(st_intersection(m.way, ST_Union( q.way))) / (3 *
st_area(m.way))
when q.highway in ('secondary') then
st_area(st_intersection(m.way, ST_Union( q.way))) / (5 *
st_area(m.way))
end
as noise_factor
into table noise_tmp
FROM osm_line_buf_50 AS m
INNER JOIN osm_line_buf_50 AS q ON ST_Intersects(m.way, q.way) WHERE
m.highway is not null and q.highway in ('motorway',
'motorway_link','trunk','trunk_link','primary','primary_link','secondary')
GROUP BY losmid, lhighway, m.way, q.highway, q.maxspeed_class order by
noise_factor desc;

EXPLAIN
-----------------
Sort (cost=337741901772.50..337742103465.40 rows=80677160 width=1020)
Sort Key: ((st_area(st_intersection(m.way, st_union(q.way),
'-1'::double
precision)) / st_area(m.way))) DESC
-> GroupAggregate (cost=302522177901.86..337695183065.46
rows=80677160
width=1020)
Group Key: m.osm_id, m.highway, m.way
-> Gather Merge (cost=302522177901.86..317079159018.69
rows=124988590885 width=1675)
Workers Planned: 2
-> Sort (cost=302522176901.84..302652373350.68
rows=52078579535 width=1675)
Sort Key: m.osm_id, m.highway, m.way
-> Nested Loop (cost=0.42..217422937427.56
rows=52078579535 width=1675)
-> Parallel Seq Scan on osm_poly_no_buf q
(cost=0.00..223957561.03 rows=939141 width=663)
Filter: (((landuse = ANY
('{forest,allotments,flowerbed,orchard,vineyard,recreation_ground,village_green}'::text[]))
OR (leisure = ANY ('{garden,park,nature_reserve}'::text[]))) AND
((st_area((st_transform(way, 4326))::geography, true) / '1000000'::double
precision) < '5000'::double precision))
-> Index Scan using osm_line_buf_50_idx on
osm_line_buf_50 m (cost=0.42..231193.41 rows=8068 width=1012)
Index Cond: (way && q.way)
Filter: ((highway IS NOT NULL) AND
st_intersects(way, q.way))
JIT:
Functions: 12
Options: Inlining true, Optimization true, Expressions true, Deforming
true
(17 rows)

--SELECT datname, pid, state, query, age(clock_timestamp(), query_start)
AS
age
select *
FROM pg_stat_activity
WHERE state <> 'idle'
AND query NOT LIKE '% FROM pg_stat_activity %';
datid | datname | pid | leader_pid | usesysid | usename |
application_name | client_addr | client_hostname | client_port |
backend_start | xact_start |
query_start
| state_change | wait_event_type | wait_event

| state | backend_xid | backend_xmin | query_id |

query

| backend_type

-----------+---------+---------+------------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+----------+----------------------------------------------------------------------------------------------------------+-----------------
105680476 | asia | 1192433 | | 10 | postgres | psql

| | | -1 | 2023-06-26
16:02:05.768052+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26
16:34:26.164375+02 | 2023-06-26 16:34:26.164375+02 | IO |
DataFileRead | active | 289276 | 289208 | | SELECT

+| client backend
| | | | | |

| | | |
| | |

| | | |

| | | m.osm_id losmid, m.highway
lhighway, q.highway as qhighway, q.maxspeed_class,
+|
| | | | | |

| | | |
| | |

| | | |

| | | case

+|
| | | | | |

| | | |
| | |

| | | |

| | | when q.highway in ('motorway',
'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+|
..............
103790736 | europe | 1183475 | | 10 | postgres | psql

| | | -1 | 2023-06-26
15:21:15.376492+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26
17:22:51.420038+02 | 2023-06-26 17:22:51.420039+02 | |

| active | 289286 | 289276 | | select xid ,
sum(water_river_see) as river_see

+| client backend
| | | | | |

| | | |
| | |

| | | |

| | | into table river_tmp

+|
| | | | | |

| | | |
| | |

| | | |

| | | from (

+|
......
105680476 | asia | 1197807 | 1192433 | 10 | postgres | psql

| | | | 2023-06-26
16:34:26.169606+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26
16:34:26.164375+02 | 2023-06-26 16:34:26.183272+02 | IPC |
BufferIO | active | | 289208 | | SELECT

+| parallel worker
| | | | | |

| | | |
| | |

| | | |

| | | m.osm_id losmid, m.highway
lhighway, q.highway as qhighway, q.maxspeed_class,
+|
| | | | | |

| | | |
| | |

| | | |

| | | case

+|
| | | | | |

| | | |
| | |

| | | |

| | | when q.highway in ('motorway',
'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+|
| | | | | |

| | | |
| | | | | |

| | | |
..........
105680476 | asia | 1197808 | 1192433 | 10 | postgres | psql

| | | | 2023-06-26
16:34:26.169819+02 | 2023-06-26 16:34:26.164375+02 | 2023-06-26
16:34:26.164375+02 | 2023-06-26 16:34:26.183619+02 | IPC |
BufferIO | active | | 289208 | | SELECT

+| parallel worker
| | | | | |

| | | |
| | |

| | | |

| | | m.osm_id losmid, m.highway
lhighway, q.highway as qhighway, q.maxspeed_class,
+|
| | | | | |

| | | |
| | |

| | | |

| | | case

+|
| | | | | |

| | | |
| | |

| | | |

| | | when q.highway in ('motorway',
'motorway_link','trunk','trunk_link') and q.maxspeed_class < 1.1 then+|
| | | | | |

| | | |
............
103790736 | europe | 1208383 | 1183475 | 10 | postgres | psql

| | | | 2023-06-26
17:22:51.749632+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26
17:22:51.420038+02 | 2023-06-26 17:22:51.76465+02 | IPC |
MessageQueueSend | active | | 289276 | | select
xid , sum(water_river_see) as river_see

+| parallel worker
| | | | | |

| | | |
| | |

| | | |

| | | into table river_tmp

+|
| | | | | |

| | | |
| | |

| | | |

| | | from (

+|
| | | | | |

| | | |
| | |

| | | |

| | | SELECT m.osm_id as xid,

+|
| | | | | |

| | | |
| | |

| | | |

| | | st_area(st_intersection(m.way,
ST_Union( q.way))) / st_area(m.way) +|
| | | | | |

| | | |
..............
103790736 | europe | 1208384 | 1183475 | 10 | postgres | psql

| | | | 2023-06-26
17:22:51.749853+02 | 2023-06-26 17:22:51.420038+02 | 2023-06-26
17:22:51.420038+02 | 2023-06-26 17:22:51.764351+02 | IPC |
MessageQueueSend | active | | 289276 | | select
xid , sum(water_river_see) as river_see

+| parallel worker
| | | | | |

| | | |
| | |

| | | |

| | | into table river_tmp

+|
| | | | | |

| | | |
| | |

| | | |

| | | from (

+|
| | | | | |

| | | |
| | |

| | | |

| | | SELECT m.osm_id as xid,

+|
..............

free -h -w
total used free shared buffers
cache available
Mem: 125Gi 35Gi 785Mi 50Gi 7.0Mi

89Gi 38Gi
Swap: 4.0Gi 4.0Gi 0.0Ki

top -n 1 -u postgres
top - 18:55:27 up 22 days, 5:38, 1 user, load average: 3.65, 4.03, 4.14
Tasks: 425 total, 2 running, 423 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.8 us, 2.4 sy, 0.0 ni, 87.5 id, 2.2 wa, 0.0 hi, 0.2 si,
0.0
st
MiB Mem : 128504.4 total, 827.2 free, 36121.8 used, 91555.4
buff/cache
MiB Swap: 4091.0 total, 0.0 free, 4091.0 used. 39010.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
1183475 postgres 20 0 62.9g 60.3g 48.7g R 93.8 48.0 190:34.19
postgres
1192433 postgres 20 0 51.4g 38.3g 38.2g D 18.8 30.5 58:38.73
postgres
1197807 postgres 20 0 51.3g 18.7g 18.6g S 18.8 14.9 34:19.57
postgres
1197808 postgres 20 0 51.3g 18.9g 18.9g S 18.8 15.1 32:08.32
postgres
1169290 postgres 20 0 51.2g 877644 876996 S 0.0 0.7 0:00.64
postgres
1169293 postgres 20 0 51.2g 49.7g 49.7g S 0.0 39.6 3:46.63
postgres
1169294 postgres 20 0 51.2g 24.7g 24.7g S 0.0 19.7 0:15.49
postgres
1169295 postgres 20 0 51.2g 17900 17344 S 0.0 0.0 1:06.14
postgres
1169296 postgres 20 0 51.2g 4036 3200 S 0.0 0.0 0:00.12
postgres
1169297 postgres 20 0 73460 3028 2040 S 0.0 0.0 0:02.27
postgres
1169298 postgres 20 0 51.2g 3704 3036 S 0.0 0.0 0:00.00
postgres
1169387 postgres 20 0 2888 836 836 S 0.0 0.0 0:00.00 sh
1169388 postgres 20 0 2888 856 856 S 0.0 0.0 0:00.00 sh
1173213 postgres 20 0 2888 848 848 S 0.0 0.0 0:00.00 sh
1183474 postgres 20 0 21796 4872 4224 S 0.0 0.0 0:00.01
psql
1183699 postgres 20 0 8784 5248 3472 S 0.0 0.0 0:00.01
bash
1187783 postgres 20 0 2888 912 824 S 0.0 0.0 0:00.00 sh
1192431 postgres 20 0 21796 5476 4244 S 0.0 0.0 0:00.01
psql
1208383 postgres 20 0 61.3g 52.4g 42.3g S 0.0 41.8 2:20.11
postgres
1208384 postgres 20 0 61.3g 53.6g 43.6g S 0.0 42.7 2:21.45
postgres
1229692 postgres 20 0 7500 3884 3488 S 0.0 0.0 0:00.00
moni.sh
1229696 postgres 20 0 10896 4164 3300 R 0.0 0.0 0:00.00 top

ps -ef
postgres 1169290 1 0 14:20 ? 00:00:00
/usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/14/main -c
config_file=/etc/postgresql/14/main/postgresql.conf
postgres 1169293 1169290 0 14:20 ? 00:03:46 postgres: 14/main:
checkpointer
postgres 1169294 1169290 0 14:20 ? 00:00:16 postgres: 14/main:
background writer
postgres 1169295 1169290 0 14:20 ? 00:01:06 postgres: 14/main:
walwriter
postgres 1169296 1169290 0 14:20 ? 00:00:00 postgres: 14/main:
autovacuum launcher
postgres 1169297 1169290 0 14:20 ? 00:00:02 postgres: 14/main:
stats
collector
postgres 1169298 1169290 0 14:20 ? 00:00:00 postgres: 14/main:
logical replication launcher
postgres 1169387 1 0 14:20 ? 00:00:00 /bin/sh
./gen-europe.sh
.
postgres 1169388 1 0 14:20 ? 00:00:00 /bin/sh ./gen-rest.sh
.
postgres 1173213 1169387 0 14:35 ? 00:00:00 /bin/sh
./create-tags.sh
europe
postgres 1183474 1173213 0 15:21 ? 00:00:00
/usr/lib/postgresql/14/bin/psql -v ON_ERROR_STOP=1 -U postgres -d europe
-a
-f all.sql
postgres 1183475 1169290 92 15:21 ? 05:25:22 postgres: 14/main:
postgres europe [local] SELECT
postgres 1187783 1169388 0 15:39 ? 00:00:00 /bin/sh
./create-tags.sh
asia
postgres 1192431 1187783 0 16:02 ? 00:00:00
/usr/lib/postgresql/14/bin/psql -v ON_ERROR_STOP=1 -U postgres -d asia -a
-f
all.sql
postgres 1192433 1169290 29 16:02 ? 01:31:54 postgres: 14/main:
postgres asia [local] SELECT
postgres 1197807 1169290 24 16:34 ? 01:08:08 postgres: 14/main:
parallel worker for PID 1192433
postgres 1197808 1169290 22 16:34 ? 01:02:24 postgres: 14/main:
parallel worker for PID 1192433
postgres 1259514 1169290 34 21:08 ? 00:01:19 postgres: 14/main:
parallel worker for PID 1183475
postgres 1259515 1169290 35 21:08 ? 00:01:20 postgres: 14/main:
parallel worker for PID 1183475

Thank for any help !

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2023-06-29 08:04:21 BUG #18006: recovery_target_action=shutdown triggers automatic recovery on next startup (beyond point in time)
Previous Message Michael Paquier 2023-06-29 02:22:23 Re: BUG #17997: Assert failed in validatePartitionedIndex() when attaching partition index to child of valid index