Re: Delete tables difference involves seq scan

From: Danylo Hlynskyi <abcz2(dot)uprola(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Delete tables difference involves seq scan
Date: 2017-12-01 14:52:54
Message-ID: CANZg+yfVLZ7edD8OjDoz3+soWsG=NQOe7DBZvo=ZXr-O1FFquw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I was able to speedup original query a lot by using CTE. It still uses seq
scan on `diff` table, but looks like it does this once:

explain
with
diff as (select id from drug_refills_diff)
delete from drug_refills
where id in (select id from diff);
QUERY
PLAN
---------------------------------------------------------------------------------------------------
Delete on drug_refills (cost=989844.94..990366.86 rows=456888836 width=46)
CTE diff
-> Seq Scan on drug_refills_diff (cost=0.00..720404.88 rows=11975088
width=16)
-> Nested Loop (cost=269440.05..269961.98 rows=456888836 width=46)
-> HashAggregate (cost=269439.48..269441.48 rows=200 width=56)
Group Key: diff.id
-> CTE Scan on diff (cost=0.00..239501.76 rows=11975088
width=56)
-> Index Scan using drug_refills_pkey on drug_refills
(cost=0.57..2.59 rows=1 width=22)
Index Cond: (id = diff.id)
(9 rows)

2017-12-01 15:17 GMT+02:00 Danylo Hlynskyi <abcz2(dot)uprola(at)gmail(dot)com>:

> Oh, sorry, this happens on Postgresql 9.6.6. I've checked that on
> Postgresql 10.0 query plan from setup (1)
> now uses two seq scans, like in setup (2).
>
>
> 2017-12-01 15:03 GMT+02:00 Danylo Hlynskyi <abcz2(dot)uprola(at)gmail(dot)com>:
>
>> Hello. I want to remove rows from first table, that exist in second
>> (equality is done using PK). However I experience seq scan on second table,
>> which counters my intuition - I think it should be index-only. Because
>> tables are large, performance of query is very bad.
>>
>> However I got mixed results when trying to reproduce this behavior on
>> syntetic tables. Here I'll show 3 different plans, which I got for the same
>> query.
>>
>> 1. Setup is:
>> ---------------------------
>> create table diff (id uuid constraint diff_pkey primary key);
>> create table origin (id uuid constraint origin_pkey primary key);
>> ---------------------------
>>
>> The query generates correct plan, which performs only index scans:
>>
>> explain delete from origin where exists (select id from diff where
>> origin.id = diff.id);
>> QUERY PLAN
>> ------------------------------------------------------------
>> -------------------------------
>> Delete on origin (cost=0.30..105.56 rows=1850 width=12)
>> -> Merge Semi Join (cost=0.30..105.56 rows=1850 width=12)
>> Merge Cond: (origin.id = diff.id)
>> -> Index Scan using origin_pkey on origin (cost=0.15..38.90
>> rows=1850 width=22)
>> -> Index Scan using diff_pkey on diff (cost=0.15..38.90
>> rows=1850 width=22)
>> (5 rows)
>>
>> 2. Setup is:
>> --------------------------------
>> create table origin (id uuid constraint origin_pkey primary key, data
>> jsonb);
>> create table diff (id uuid constraint diff_pkey primary key, data jsonb);
>> --------------------------------
>>
>> The query generates plan with two seq scans:
>>
>> explain delete from origin where exists (select id from diff where
>> origin.id = diff.id);
>> QUERY PLAN
>> ------------------------------------------------------------
>> ---------------
>> Delete on origin (cost=34.08..69.49 rows=1070 width=12)
>> -> Hash Semi Join (cost=34.08..69.49 rows=1070 width=12)
>> Hash Cond: (origin.id = diff.id)
>> -> Seq Scan on origin (cost=0.00..20.70 rows=1070 width=22)
>> -> Hash (cost=20.70..20.70 rows=1070 width=22)
>> -> Seq Scan on diff (cost=0.00..20.70 rows=1070 width=22)
>> (6 rows)
>>
>> 3. My real `origin` table has 26 fields and 800 billion rows, real `diff`
>> table has 12 million rows and the query generates plan with nested loop and
>> seq scan on `diff` table:
>>
>> explain delete from drug_refills origin where exists (select id from
>> drug_refills_diff diff where origin.id = diff.id);
>> QUERY PLAN
>> ------------------------------------------------------------
>> ----------------------------------------------
>> Delete on drug_refills origin (cost=0.57..22049570.11 rows=11975161
>> width=12)
>> -> Nested Loop (cost=0.57..22049570.11 rows=11975161 width=12)
>> -> Seq Scan on drug_refills_diff diff (cost=0.00..720405.61
>> rows=11975161 width=22)
>> -> Index Scan using drug_refills_pkey on drug_refills origin
>> (cost=0.57..1.77 rows=1 width=22)
>> Index Cond: (id = diff.id)
>> (5 rows)
>>
>> I have run ANALYZE on both tables, but it didn't help. Here are column
>> types in origin and diff (same schema), if that matters:
>>
>> uuid
>> timestamp with time zone
>> timestamp with time zone
>> character varying(255)
>> character varying(255)
>> character varying(1024)
>> numeric(10,4)
>> integer
>> numeric(14,8)
>> numeric(14,8)
>> numeric(14,8)
>> numeric(14,8)
>> numeric(14,8)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> date
>> jsonb
>> text[]
>> uuid
>> uuid
>> uuid
>> uuid
>>
>>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Roman Konoval 2017-12-01 17:20:28 Bad plan for ltree predicate <@
Previous Message Danylo Hlynskyi 2017-12-01 13:17:30 Re: Delete tables difference involves seq scan