Non-deterministic 100% CPU hang on postgres 9.3

From: Fenn Bailey <fenn(dot)bailey(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Non-deterministic 100% CPU hang on postgres 9.3
Date: 2014-04-15 05:42:44
Message-ID: CAKnSMKmPLH2s1uVExw=QWevyARhOSYvAQ4v8NKKMzOeq1GsCoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

I'm experiencing an issue where certain queries appear to
non-deterministically "hang", with a CPU pinned at 100%.

I say "hang", where really I've given up after ~12 hours execution. The
exact same query can then be terminated and run in <90 seconds, with none
of the underlying data changing.

I can completely reset the DB (drop tables/recreate) and re-run and
sometimes certain queries will appear to hang forever, sometimes they will
execute in minutes.

I've tried official debian (amd64) packages of postgres 9.3.2 - 9.3.4 (same
issues).

The workload is ostensibly analytics - As a disclaimer I absolutely
understand that these queries may not be non-optimized (I didn't build
them/have not reviewed fully). The part I'm trying to work out is why they
sometimes finish easily in minutes and other times hang.

I've straced the process when it's pinned and there appears to be no system
calls (ie: no IO) just a pure CPU loop.

If anyone could explain this behaviour, or even how I might go about
diagnosing, that would be wonderful.

Query/analyze details follow.

Thanks!

Query:
----
UPDATE ad_events e
set ad_user_id = x.ad_user_id, ad_session_id = x.ad_session_id
FROM
(SELECT t.ad_event_id, COALESCE (s.ad_user_id, u.merged_id,u.ad_user_id)
ad_user_id,
case when (name = 'Clickthrough from Email' or properties->('mp_lib')
= 'web' ) then s.ad_session_id
else null
end as ad_session_id,
t.timestamp, name , properties
from ad_events_mv t
INNER JOIN ad_users u ON (t.user_id = u.orig_distinct_id)
LEFT OUTER JOIN ad_sessions s ON (t.timestamp between s.session_start
and s.session_end and (s.ad_user_id = u.ad_user_id or s.ad_user_id =
u.merged_id))
where t.processed = false
order by s.ad_user_id, s.ad_session_id, timestamp) x
WHERE
x.ad_event_id = e.ad_event_id;
----

EXPLAIN ANALYZE follows: QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Update on ad_events e (cost=39730372.92..39765372.92 rows=1000000
width=237) (actual time=130140.450..130140.450 rows=0 loops=1)
-> Hash Join (cost=39730372.92..39765372.92 rows=1000000 width=237)
(actual time=54243.877..66848.448 rows=2000000 loops=1)
Hash Cond: (x.ad_event_id = e.ad_event_id)
-> Subquery Scan on x (cost=39654433.45..39666933.45
rows=1000000 width=144) (actual time=52682.740..57668.998 rows=2000000
loops=1)
-> Sort (cost=39654433.45..39656933.45 rows=1000000
width=108) (actual time=52682.693..55003.467 rows=2000000 loops=1)
Sort Key: s.ad_user_id, s.ad_session_id, t."timestamp"
Sort Method: external merge Disk: 1078104kB
-> Nested Loop Left Join (cost=12054.20..39554775.61
rows=1000000 width=108) (actual time=204.872..43742.396 rows=2000000
loops=1)
Join Filter: ((t."timestamp" >= s.session_start)
AND (t."timestamp" <= s.session_end))
Rows Removed by Join Filter: 18792613
-> Hash Join (cost=12046.09..203878.09
rows=1000000 width=92) (actual time=204.402..4293.175 rows=2000000 loops=1)
Hash Cond: (t.user_id = u.orig_distinct_id)
-> Seq Scan on ad_events_mv t
(cost=0.00..173082.00 rows=1000000 width=112) (actual time=0.022..1813.272
rows=2000000 loops=1)
Filter: (NOT processed)
-> Hash (cost=7932.15..7932.15
rows=329115 width=67) (actual time=204.166..204.166 rows=329115 loops=1)
Buckets: 65536 Batches: 1 Memory
Usage: 30590kB
-> Seq Scan on ad_users u
(cost=0.00..7932.15 rows=329115 width=67) (actual time=0.009..85.231
rows=329115 loops=1)
-> Bitmap Heap Scan on ad_sessions s
(cost=8.11..39.22 rows=8 width=32) (actual time=0.007..0.015 rows=10
loops=2000000)
Recheck Cond: ((ad_user_id = u.ad_user_id)
OR (ad_user_id = u.merged_id))
-> BitmapOr (cost=8.11..8.11 rows=8
width=0) (actual time=0.005..0.005 rows=0 loops=2000000)
-> Bitmap Index Scan on
ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0) (actual
time=0.003..0.003 rows=8 loops=2000000)
Index Cond: (ad_user_id =
u.ad_user_id)
-> Bitmap Index Scan on
ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0) (actual
time=0.001..0.001 rows=3 loops=2000000)
:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Update on ad_events e (cost=39730372.92..39765372.92 rows=1000000
width=237) (actual time=130140.450..130140.450 rows=0 loops=1)
-> Hash Join (cost=39730372.92..39765372.92 rows=1000000 width=237)
(actual time=54243.877..66848.448 rows=2000000 loops=1)
Hash Cond: (x.ad_event_id = e.ad_event_id)
-> Subquery Scan on x (cost=39654433.45..39666933.45
rows=1000000 width=144) (actual time=52682.740..57668.998 rows=2000000
loops=1)
-> Sort (cost=39654433.45..39656933.45 rows=1000000
width=108) (actual time=52682.693..55003.467 rows=2000000 loops=1)
Sort Key: s.ad_user_id, s.ad_session_id, t."timestamp"
Sort Method: external merge Disk: 1078104kB
-> Nested Loop Left Join (cost=12054.20..39554775.61
rows=1000000 width=108) (actual time=204.872..43742.396 rows=2000000
loops=1)
Join Filter: ((t."timestamp" >= s.session_start)
AND (t."timestamp" <= s.session_end))
Rows Removed by Join Filter: 18792613
-> Hash Join (cost=12046.09..203878.09
rows=1000000 width=92) (actual time=204.402..4293.175 rows=2000000 loops=1)
Hash Cond: (t.user_id = u.orig_distinct_id)
-> Seq Scan on ad_events_mv t
(cost=0.00..173082.00 rows=1000000 width=112) (actual time=0.022..1813.272
rows=2000000 loops=1)
Filter: (NOT processed)
-> Hash (cost=7932.15..7932.15
rows=329115 width=67) (actual time=204.166..204.166 rows=329115 loops=1)
Buckets: 65536 Batches: 1 Memory
Usage: 30590kB
-> Seq Scan on ad_users u
(cost=0.00..7932.15 rows=329115 width=67) (actual time=0.009..85.231
rows=329115 loops=1)
-> Bitmap Heap Scan on ad_sessions s
(cost=8.11..39.22 rows=8 width=32) (actual time=0.007..0.015 rows=10
loops=2000000)
Recheck Cond: ((ad_user_id = u.ad_user_id)
OR (ad_user_id = u.merged_id))
-> BitmapOr (cost=8.11..8.11 rows=8
width=0) (actual time=0.005..0.005 rows=0 loops=2000000)
-> Bitmap Index Scan on
ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0) (actual
time=0.003..0.003 rows=8 loops=2000000)
Index Cond: (ad_user_id =
u.ad_user_id)
-> Bitmap Index Scan on
ad_sessions_ad_user_id_idx (cost=0.00..4.05 rows=4 width=0) (actual
time=0.001..0.001 rows=3 loops=2000000)
Index Cond: (ad_user_id =
u.merged_id)
-> Hash (cost=50938.65..50938.65 rows=2000065 width=101) (actual
time=1560.037..1560.037 rows=2000000 loops=1)
Buckets: 262144 Batches: 1 Memory Usage: 253047kB
-> Seq Scan on ad_events e (cost=0.00..50938.65
rows=2000065 width=101) (actual time=0.014..633.555 rows=2000000 loops=1)

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2014-04-15 05:49:41 Re: Non-deterministic 100% CPU hang on postgres 9.3
Previous Message Adrian Klaver 2014-04-15 04:08:51 Re: Unrecognized service