From: | Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | anti-join with small table via text/varchar cannot estimate rows correctly |
Date: | 2017-03-01 22:00:06 |
Message-ID: | 4de52c93-68e1-7c4b-86ff-4dc17ae400bb@synedra.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hello,
I have encountered a strange problem when doing an anti-join with a very
small table via a varchar or text field as opposed to an integer field.
Postgres version is 9.5.3
I did some experiments to extract the problem in a simple form. FIrst
generate two tables with a series of numbers - once as integers once as
text. The first table has 10,000 rows the second table just one:
=# select generate_series(1, 10000) as id, generate_series(1,10000)::text as text into table tmp_san_1;
SELECT 10000
=# select generate_series(1, 1) as id, generate_series(1,1)::text as text into table tmp_san_2;
SELECT 1
=# analyze tmp_san_1;
ANALYZE
=# analyze tmp_san_2;
ANALYZE
=# \d tmp_san_*
Table "public.tmp_san_1"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
text | text |
Table "public.tmp_san_2"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
text | text |
Now I do an anti-join between the two tables via the id field (integer).
The number of resulting rows are estimated correctly as 9,999:
=# explain analyze
select tmp_san_1.id
from tmp_san_1
left join tmp_san_2 on tmp_san_1.id = tmp_san_2.id
where tmp_san_2.id is null;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=1.02..281.26 rows=9999 width=4) (actual time=0.019..2.743 rows=9999 loops=1)
Hash Cond: (tmp_san_1.id = tmp_san_2.id)
-> Seq Scan on tmp_san_1 (cost=0.00..154.00 rows=10000 width=4) (actual time=0.007..1.023 rows=10000 loops=1)
-> Hash (cost=1.01..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on tmp_san_2 (cost=0.00..1.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=1)
Planning time: 0.138 ms
Execution time: 3.218 ms
(8 rows)
The same anti-join using the text fields, however estimates just 1
resulting row, while there are still of course 9,999 of them:
=# explain analyze
select tmp_san_1.id
from tmp_san_1
left join tmp_san_2 on tmp_san_1.text = tmp_san_2.text
where tmp_san_2.id is null;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.02..192.53 rows=1 width=4) (actual time=0.020..3.091 rows=9999 loops=1)
Hash Cond: (tmp_san_1.text = tmp_san_2.text)
Filter: (tmp_san_2.id IS NULL)
Rows Removed by Filter: 1
-> Seq Scan on tmp_san_1 (cost=0.00..154.00 rows=10000 width=8) (actual time=0.008..0.983 rows=10000 loops=1)
-> Hash (cost=1.01..1.01 rows=1 width=6) (actual time=0.004..0.004 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on tmp_san_2 (cost=0.00..1.01 rows=1 width=6) (actual time=0.002..0.002 rows=1 loops=1)
Planning time: 0.173 ms
Execution time: 3.546 ms
(10 rows)
I cannot explain that behavior and much less think of a fix or
workaround. Unfortunately my real-world example has to use varchar for
the join.
Thanks for any help,
Stefan
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2017-03-01 22:12:29 | Re: anti-join with small table via text/varchar cannot estimate rows correctly |
Previous Message | Jeff Janes | 2017-03-01 17:04:13 | Re: Speeding up JSON + TSQUERY + GIN |