BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup

From: charles(dot)harwood(at)zuerchertech(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup
Date: 2015-01-30 22:14:45
Message-ID: 20150130221445.2583.8173@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 12718
Logged by: Charles R. Harwood
Email address: charles(dot)harwood(at)zuerchertech(dot)com
PostgreSQL version: 9.3.5
Operating system: Ubuntu Linux 12.04 Kernel 3.13.0
Description:

Many postgres worker processes stopped responding. They continued to be in
state 'active' and not 'waiting' for hours. Each query was on the table
that an autovacuum process was also hung on processing and was the oldest
active transaction on.

Eventually I had to restart the cluster.

The theory is that somehow the VACUUM got the table in a state such that
certain queries ran against it would get hung trying to wrestle an LWlock
away from it.

This shows a subset of the pg_stat_activity many hours into the problem:
# SELECT NOW() - xact_start AS xact_age, query_start, pid, waiting, state,
SUBSTRING(query, 1, 30) FROM pg_stat_activity WHERE state <> 'idle' ORDER BY
NOW() - xact_start DESC;
xact_age | query_start | pid | waiting | state | substring
-----------------+-------------------------------+-------+---------+--------+----------------------------------------------------
20:04:20.469875 | 2015-01-29 17:39:21.520094-06 | 24718 | f | active |
autovacuum: VACUUM public.addr
19:59:58.066609 | 2015-01-29 17:43:43.92336-06 | 24687 | f | active |
SELECT s0.id AS id, f0, f1, f1
19:59:40.054033 | 2015-01-29 17:44:01.935936-06 | 21386 | f | active |
SELECT s0.id AS id, f0, f1, f1
19:59:24.117727 | 2015-01-29 17:44:17.872242-06 | 20370 | f | active |
SELECT s0.id AS id, f0, f1, f1
19:32:51.946155 | 2015-01-29 18:10:50.064425-06 | 2900 | f | active |
UPDATE addresses...
19:11:38.715295 | 2015-01-29 18:32:03.274674-06 | 13545 | f | active |
SELECT s0.id AS id, f0, f1, f1
18:53:00.421641 | 2015-01-29 18:50:41.568328-06 | 2367 | f | active |
SELECT '', '', street, city, s
12:53:32.658331 | 2015-01-30 00:50:09.331638-06 | 22220 | f | active |
SELECT '', '', street, city, s
11:46:31.322061 | 2015-01-30 01:57:10.667908-06 | 17054 | f | active |
SELECT addressesid FROM ((SEL
11:36:33.11271 | 2015-01-30 02:07:08.877259-06 | 22378 | f | active |
SELECT addressesid FROM ((SEL
07:43:06.642865 | 2015-01-30 06:00:37.139222-06 | 26815 | f | active |
SELECT s0.id AS id, f0, f1, f1
04:03:57.508697 | 2015-01-30 09:39:44.481272-06 | 19351 | f | active |
SELECT addressesid FROM ((SEL
03:37:39.971883 | 2015-01-30 10:06:02.018086-06 | 10422 | f | active |
SELECT addressesid FROM ((SEL
02:39:14.576335 | 2015-01-30 11:04:27.413634-06 | 21364 | f | active |
SELECT addressesid FROM ((SEL
02:24:00.125199 | 2015-01-30 11:19:41.86477-06 | 14600 | f | active |
SELECT addressesid FROM ((SEL
02:03:32.915133 | 2015-01-30 11:40:09.074836-06 | 22547 | f | active |
SELECT addressesid FROM ((SEL
01:53:49.639909 | 2015-01-30 11:49:52.35006-06 | 31567 | f | active |
SELECT s0.id AS id, f0, f1, f1
01:53:44.700417 | 2015-01-30 11:49:57.289552-06 | 31571 | f | active |
SELECT s0.id AS id, f0, f1, f1
01:53:39.645157 | 2015-01-30 11:50:02.344812-06 | 31623 | f | active |
SELECT s0.id AS id, f0, f1, f1
01:53:20.892623 | 2015-01-30 11:50:21.097346-06 | 31887 | f | active |
SELECT s0.id AS id, f0, f1, f1
01:08:30.519618 | 2015-01-30 12:35:11.470351-06 | 21167 | f | active |
SELECT namesid, name, type, in
01:06:58.586233 | 2015-01-30 12:36:43.403736-06 | 24513 | f | active |
SELECT data.f5 AS y_id, data.f
01:06:18.159009 | 2015-01-30 12:37:23.83096-06 | 21047 | f | active |
SELECT namesid, name, type, in

Here is what those processes were up to. I have full backtraces which I
will attach when this gets to a bugtracker.

# ps -o pid,s,cmd,wchan=WIDE-WCHAN-COLUMN -p
2367,2900,10422,13545,14600,17054,19351,20370,21047,21167,21364,21386,22220,22378,22547,24513,24687,24718,26815,31567,31571,31623,31887
PID S CMD WIDE-WCHAN-COLUMN
2367 S postgres: leds leds 164.154 SYSC_semtimedop
2900 S postgres: leds leds [local] SYSC_semtimedop
10422 S postgres: leds leds 164.154 SYSC_semtimedop
13545 S postgres: leds leds 164.154 SYSC_semtimedop
14600 S postgres: leds leds 164.154 SYSC_semtimedop
17054 S postgres: leds leds 164.154 SYSC_semtimedop
19351 S postgres: leds leds 164.154 SYSC_semtimedop
20370 S postgres: leds leds 164.154 SYSC_semtimedop
21047 S postgres: leds leds 164.154 SYSC_semtimedop
21167 S postgres: leds leds 164.154 SYSC_semtimedop
21364 S postgres: leds leds 164.154 SYSC_semtimedop
21386 S postgres: leds leds 164.154 SYSC_semtimedop
22220 S postgres: leds leds 164.154 SYSC_semtimedop
22378 S postgres: leds leds 164.154 SYSC_semtimedop
22547 S postgres: leds leds 164.154 SYSC_semtimedop
24513 R postgres: leds leds 164.154 -
24687 S postgres: leds leds 164.154 SYSC_semtimedop
24718 S postgres: autovacuum worker poll_schedule_timeout
26815 S postgres: leds leds 164.154 SYSC_semtimedop
31567 S postgres: leds leds 164.154 SYSC_semtimedop
31571 S postgres: leds leds 164.154 SYSC_semtimedop
31623 S postgres: leds leds 164.154 SYSC_semtimedop
31887 S postgres: leds leds 164.154 SYSC_semtimedop

Not sure if it will help finding how this happened, but it has happened
twice, both times on relatively large databases (tens to hundreds of
gigabytes) with several hundred concurrent connectons. Both times it
happened while running a long-running script that updates each record in the
'addresses' table. The table contains 1-5 million records, each ~1KB wide
and references several other tables and is itself referenced by a dozen
tables.

The script starts a transaction, runs 20 single-record UPDATEs in quick
succession, COMMITs, and then briefly sleeps. So if nothing else, the txid
is going up fairly rapidly and the database is working reasonably hard.

Hardware does not appear to be a factor, but both servers on which this
occurred are Dell rackmount servers with two modern intel chips ~E5-2630s
and 32-128GB of memory.

Again, stack traces will be attached.

Thank you.
-Charles

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2015-01-31 00:41:15 Re: BUG #12718: Apparent LWlock Problem Perhaps With Page Cleanup
Previous Message Tom Lane 2015-01-30 17:32:03 Re: rangesel() bugs on basic comparison operators with an empty range