Re: BUG #13908: Query returns too few rows

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: seth-p(at)outlook(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13908: Query returns too few rows
Date: 2016-02-02 22:31:35
Message-ID: 20160202223135.GA157331@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

seth-p(at)outlook(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 13908
> Logged by: Seth
> Email address: seth-p(at)outlook(dot)com
> PostgreSQL version: 9.5.0
> Operating system: Windows 10
> Description:
>
> Below is a copy of a long-ish email I wrote detailing the problem. If it
> doesn't come through in a reasonable format, please let me know where I can
> email it. Thanks --Seth

Here's the copy Seth sent with better formatting.

--------------------------

I've only recently started using Postgresql, and have encountered what I believe to be a bug in Postgresql 9.5 on Windows 10.
Below are 6 queries that are identical except for (a) the date range in the WHERE clause, and (b) whether or not they include DISTINCT. This is as simple as I could get the queries while still reproducing the bug. The following table summarizes the number of rows returned by each query:

-----------------------------------------------------------------------------------------------------------------------------------
-- Date Range | # rows without DISTINCT | # rows with DISTINCT | Notes |
------------------------------------------------------------------------------|----------------------------------------------------
-- 2005-03-01 - 2005-07-30 | (A) 415,983 | (A-D) 416,075 | DISTINCT *increases* the number of rows returned! |
-- 2005-03-01 - 2005-04-30 | (B) 168,886 | (B-D) 168,886 | DISTINCT has no effect |
-- 2005-05-01 - 2005-07-30 | (C) 247,189 | (C-D) 247,189 | DISTINCT has no effect |
-- sum of sub-range queries | (B) + (C) 416,075 | (B-D) + (C-D) 416,075 | In both cases, sum of 2 queries equals (A-D) |
-----------------------------------------------------------------------------------------------------------------------------------

Query (A-D) (with DISTINCT) should not return more rows than query (A) (the identical query without DISTINCT), so clearly something is wrong there.
Looking at the results of the queries over the two sub-ranges strongly suggests that it is query (A) that is returning too few rows.

Beneath the queries below I also show the EXPLAIN results for queries (A), (A-D), (B), and (B-D). Notice that the plan for query (A) differs from the other three -- it appears to have an extra "Hash" (highlighted with asterisks) that the others do not.

I'm afraid I am unable to share the data, but can provide table statistics, index descriptions, or anything else that might be useful. Any suggestions for how to proceed?

Seth

-- (A) 2005-03-01 - 2005-07-30 returns 415,983 rows
select count(*) from
(
SELECT
rrr.*
FROM files.models_direct_row_asset_data rrr
INNER JOIN files.models_direct_file fff
ON fff.idx = rrr.barra_file_idx
INNER JOIN files.temp_universe_instruments uuu
ON uuu.universe_hash = 5188205190738336870 AND
uuu.barrid = rrr.barrid
WHERE
fff.file_name_model_ver = '100' AND
fff.file_name_date BETWEEN '2005-03-01' AND '2005-07-30'
) a

-- (A-D) 2005-03-01 - 2005-07-30 DISTNICT returns 416,075 rows
select count(*) from
(
SELECT DISTINCT
rrr.*
FROM files.models_direct_row_asset_data rrr
INNER JOIN files.models_direct_file fff
ON fff.idx = rrr.barra_file_idx
INNER JOIN files.temp_universe_instruments uuu
ON uuu.universe_hash = 5188205190738336870 AND
uuu.barrid = rrr.barrid
WHERE
fff.file_name_model_ver = '100' AND
fff.file_name_date BETWEEN '2005-03-01' AND '2005-07-30'
) ad

-- (B) 2005-03-01 - 2005-04-30 returns 168,886 rows
select count(*) from
(
SELECT
rrr.*
FROM files.models_direct_row_asset_data rrr
INNER JOIN files.models_direct_file fff
ON fff.idx = rrr.barra_file_idx
INNER JOIN files.temp_universe_instruments uuu
ON uuu.universe_hash = 5188205190738336870 AND
uuu.barrid = rrr.barrid
WHERE
fff.file_name_model_ver = '100' AND
fff.file_name_date BETWEEN '2005-03-01' AND '2005-04-30'
) b

-- (B-D) 2005-03-01 - 2005-04-30 DISTINCT returns 168,886 rows
select count(*) from
(
SELECT DISTINCT
rrr.*
FROM files.models_direct_row_asset_data rrr
INNER JOIN files.models_direct_file fff
ON fff.idx = rrr.barra_file_idx
INNER JOIN files.temp_universe_instruments uuu
ON uuu.universe_hash = 5188205190738336870 AND
uuu.barrid = rrr.barrid
WHERE
fff.file_name_model_ver = '100' AND
fff.file_name_date BETWEEN '2005-03-01' AND '2005-04-30'
) bd

-- (C) 2005-05-01 - 2005-07-30 returns 247,189 rows
select count(*) from
(
SELECT
rrr.*
FROM files.models_direct_row_asset_data rrr
INNER JOIN files.models_direct_file fff
ON fff.idx = rrr.barra_file_idx
INNER JOIN files.temp_universe_instruments uuu
ON uuu.universe_hash = 5188205190738336870 AND
uuu.barrid = rrr.barrid
WHERE
fff.file_name_model_ver = '100' AND
fff.file_name_date BETWEEN '2005-05-01' AND '2005-07-30'
) c

-- (C-D) 2005-05-01 - 2005-07-30 DISTINCT returns 247,189 rows
select count(*) from
(
SELECT DISTINCT
rrr.*
FROM files.models_direct_row_asset_data rrr
INNER JOIN files.models_direct_file fff
ON fff.idx = rrr.barra_file_idx
INNER JOIN files.temp_universe_instruments uuu
ON uuu.universe_hash = 5188205190738336870 AND
uuu.barrid = rrr.barrid
WHERE
fff.file_name_model_ver = '100' AND
fff.file_name_date BETWEEN '2005-05-01' AND '2005-07-30'
) cd

EPXLAIN of queries

(A)
"Aggregate (cost=1821781.18..1821781.19 rows=1 width=0)"
" -> Hash Join (cost=1811365.12..1821096.53 rows=273861 width=0)"
" Hash Cond: ((uuu.barrid)::text = (rrr.barrid)::text)"
" -> Bitmap Heap Scan on temp_universe_instruments uuu (cost=231.25..932.60 rows=8108 width=8)"
" Recheck Cond: (universe_hash = '5188205190738336870'::bigint)"
" -> Bitmap Index Scan on pk_temp_universe_instruments (cost=0.00..229.23 rows=8108 width=0)"
" Index Cond: (universe_hash = '5188205190738336870'::bigint)"
" -> Hash (cost=1796474.21..1796474.21 rows=893492 width=8)" **********************************************************
" -> Hash Join (cost=2341.15..1796474.21 rows=893492 width=8)"
" Hash Cond: (rrr.barra_file_idx = fff.idx)"
" -> Seq Scan on models_direct_row_asset_data rrr (cost=0.00..1518763.74 rows=71049174 width=12)"
" -> Hash (cost=2329.00..2329.00 rows=972 width=4)"
" -> Bitmap Heap Scan on models_direct_file fff (cost=35.33..2329.00 rows=972 width=4)"
" Recheck Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-07-30'::date))"
" Filter: ((file_name_model_ver)::text = '100'::text)"
" -> Bitmap Index Scan on ix_models_direct_file_file_name_date (cost=0.00..35.08 rows=1479 width=0)"
" Index Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-07-30'::date))"

(A-D)
"Aggregate (cost=1886439.23..1886439.24 rows=1 width=0)"
" -> Unique (cost=1875484.79..1883015.97 rows=273861 width=64)"
" -> Sort (cost=1875484.79..1876169.44 rows=273861 width=64)"
" Sort Key: rrr.idx, rrr.row_number, rrr.barrid, rrr.yield_pct, rrr.total_risk_pct, rrr.spec_risk_pct, rrr.hist_beta, rrr.pred_beta, rrr.data_date, rrr.barra_file_idx"
" -> Hash Join (cost=3375.10..1840453.92 rows=273861 width=64)"
" Hash Cond: ((rrr.barrid)::text = (uuu.barrid)::text)"
" -> Hash Join (cost=2341.15..1796474.21 rows=893492 width=64)"
" Hash Cond: (rrr.barra_file_idx = fff.idx)"
" -> Seq Scan on models_direct_row_asset_data rrr (cost=0.00..1518763.74 rows=71049174 width=64)"
" -> Hash (cost=2329.00..2329.00 rows=972 width=4)"
" -> Bitmap Heap Scan on models_direct_file fff (cost=35.33..2329.00 rows=972 width=4)"
" Recheck Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-07-30'::date))"
" Filter: ((file_name_model_ver)::text = '100'::text)"
" -> Bitmap Index Scan on ix_models_direct_file_file_name_date (cost=0.00..35.08 rows=1479 width=0)"
" Index Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-07-30'::date))"
" -> Hash (cost=932.60..932.60 rows=8108 width=8)"
" -> Bitmap Heap Scan on temp_universe_instruments uuu (cost=231.25..932.60 rows=8108 width=8)"
" Recheck Cond: (universe_hash = '5188205190738336870'::bigint)"
" -> Bitmap Index Scan on pk_temp_universe_instruments (cost=0.00..229.23 rows=8108 width=0)"
" Index Cond: (universe_hash = '5188205190738336870'::bigint)"

(B)
"Aggregate (cost=1809368.84..1809368.85 rows=1 width=0)"
" -> Hash Join (cost=2484.83..1809086.39 rows=112981 width=0)"
" Hash Cond: ((rrr.barrid)::text = (uuu.barrid)::text)"
" -> Hash Join (cost=1450.88..1790335.13 rows=368611 width=8)"
" Hash Cond: (rrr.barra_file_idx = fff.idx)"
" -> Seq Scan on models_direct_row_asset_data rrr (cost=0.00..1518763.74 rows=71049174 width=12)"
" -> Hash (cost=1445.87..1445.87 rows=401 width=4)"
" -> Bitmap Heap Scan on models_direct_file fff (cost=14.49..1445.87 rows=401 width=4)"
" Recheck Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-04-30'::date))"
" Filter: ((file_name_model_ver)::text = '100'::text)"
" -> Bitmap Index Scan on ix_models_direct_file_file_name_date (cost=0.00..14.39 rows=610 width=0)"
" Index Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-04-30'::date))"
" -> Hash (cost=932.60..932.60 rows=8108 width=8)"
" -> Bitmap Heap Scan on temp_universe_instruments uuu (cost=231.25..932.60 rows=8108 width=8)"
" Recheck Cond: (universe_hash = '5188205190738336870'::bigint)"
" -> Bitmap Index Scan on pk_temp_universe_instruments (cost=0.00..229.23 rows=8108 width=0)"
" Index Cond: (universe_hash = '5188205190738336870'::bigint)"

(B-D)
"Aggregate (cost=1827336.97..1827336.98 rows=1 width=0)"
" -> Unique (cost=1822817.73..1825924.71 rows=112981 width=64)"
" -> Sort (cost=1822817.73..1823100.18 rows=112981 width=64)"
" Sort Key: rrr.idx, rrr.row_number, rrr.barrid, rrr.yield_pct, rrr.total_risk_pct, rrr.spec_risk_pct, rrr.hist_beta, rrr.pred_beta, rrr.data_date, rrr.barra_file_idx"
" -> Hash Join (cost=2484.83..1809086.39 rows=112981 width=64)"
" Hash Cond: ((rrr.barrid)::text = (uuu.barrid)::text)"
" -> Hash Join (cost=1450.88..1790335.13 rows=368611 width=64)"
" Hash Cond: (rrr.barra_file_idx = fff.idx)"
" -> Seq Scan on models_direct_row_asset_data rrr (cost=0.00..1518763.74 rows=71049174 width=64)"
" -> Hash (cost=1445.87..1445.87 rows=401 width=4)"
" -> Bitmap Heap Scan on models_direct_file fff (cost=14.49..1445.87 rows=401 width=4)"
" Recheck Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-04-30'::date))"
" Filter: ((file_name_model_ver)::text = '100'::text)"
" -> Bitmap Index Scan on ix_models_direct_file_file_name_date (cost=0.00..14.39 rows=610 width=0)"
" Index Cond: ((file_name_date >= '2005-03-01'::date) AND (file_name_date <= '2005-04-30'::date))"
" -> Hash (cost=932.60..932.60 rows=8108 width=8)"
" -> Bitmap Heap Scan on temp_universe_instruments uuu (cost=231.25..932.60 rows=8108 width=8)"
" Recheck Cond: (universe_hash = '5188205190738336870'::bigint)"
" -> Bitmap Index Scan on pk_temp_universe_instruments (cost=0.00..229.23 rows=8108 width=0)"
" Index Cond: (universe_hash = '5188205190738336870'::bigint)"

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2016-02-02 22:32:22 Re: BUG #13888: pg_dump write error
Previous Message Andres Freund 2016-02-02 21:56:33 Re: BUG #13888: pg_dump write error