Re: Bad planning data resulting in OOM killing of postgres

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: David Hinkle <hinkle(at)cipafilter(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Bad planning data resulting in OOM killing of postgres
Date: 2017-02-21 20:46:12
Message-ID: CAMkU=1zdL_04dNCZMgxRYs9w4-tJHumj+o+_3-uGVf584f7wWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Feb 16, 2017 at 3:27 PM, David Hinkle <hinkle(at)cipafilter(dot)com> wrote:

> I guess this doesn't work, latest test run crashed. It still uses the
> bad plan for the hostid column even after n_distinct is updated.
>
> cipafilter=# select attname, n_distinct from pg_stats where tablename
> cipafilter-# = 'log_raw' and (attname = 'urlid' or attname = 'titleid'
> or attname =
> cipafilter(# 'hostid');
> attname | n_distinct
> ---------+-------------
> urlid | 1.51625e+08
> hostid | 304710
> titleid | 886499
> (3 rows)
>
> cipafilter=# explain DELETE FROM hosts WHERE NOT EXISTS ( SELECT
> log_raw.hostid FROM log_raw WHERE log_raw.hostid = hosts.hostid );
> QUERY PLAN
> ------------------------------------------------------------
> -----------------------------
> Delete on hosts (cost=22249475.67..74844813.47 rows=1 width=12)
> -> Hash Anti Join (cost=22249475.67..74844813.47 rows=1 width=12)
> Hash Cond: (hosts.hostid = log_raw.hostid)
> -> Seq Scan on hosts (cost=0.00..5017.10 rows=304710 width=10)
> -> Hash (cost=12799395.52..12799395.52 rows=543645052 width=10)
> -> Seq Scan on log_raw (cost=0.00..12799395.52
> rows=543645052 width=10)
> (6 rows)
>
> I guess I will also try throwing in 'set enable_hashjoin = false;' and
> see if that gets these purges to go.
>

Another option would be to force the de-dup to happen, with:

explain with t as (select distinct hostid from log_raw) delete from hosts
where not exists (select 1 from t where t.hostid=hosts.hostid)

That way you can use the hash join without running out of memory, in case
the hash join is actually faster than the merge join. Also, it just seems
cleaner than fiddling with enable_* parameters and then having to remember
to reset them when done.

Cheers,

Jeff

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2017-02-21 20:58:00 Re: postgresql how to duplicate rows in result.
Previous Message Merlin Moncure 2017-02-21 20:14:36 Re: postgresql how to duplicate rows in result.