Quick Links

Re: New hashed IN code ignores distinctiveness of subquery

From:	Bradley Baetz <bbaetz(at)acm(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: New hashed IN code ignores distinctiveness of subquery
Date:	2003-01-26 22:37:39
Message-ID:	20030126223739.GA1193@mango.home
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

On Sun, Jan 26, 2003 at 02:09:49PM -0500, Tom Lane wrote:
>
> This isn't really anything to do with the new IN code, but is a
> long-standing problem: cost_mergejoin doesn't apply any penalty factor
> for the case where there are lots of duplicates in both inner and outer
> relation (causing rescans of big chunks of the inner relation). You can
> see the rescanning happening in the EXPLAIN ANALYZE output:
>
> > -> Merge Join (cost=0.00..3485661.38 rows=5570 width=8) (actual
> > time=0.15..1429696.69 rows=50000 loops=1)
> > Merge Cond: ("outer".product_id = "inner".product_id)
> > -> Index Scan using bugs_product_id_idx on bugs
> > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
> > rows=50000 loops=1)
> > -> Index Scan using bugs_product_id_idx on bugs
> > (cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
> > rows=277884160 loops=1)
> ^^^^^^^^^^^^^^
>
> 277884160 rows pulled from a 50000-row relation means a heck of a lot of
> rescanning went on :-(

Hmm. I'm not sure that that is the entire story. For the non-DISTINCT
version, the top of the tree has an estimated cost of 3485675.30, whilst
for the DISTINCT case, the estimated cost is 3033.30. If the planner
tried to treat ... IN (SELECT foo ...) as ... IN (SELECT DISTINCT foo
...) then presumably the second plan would have been chosen. IOW, the
cost estimate is roughly right (although its actually probably low,
because the number of rows is wrong - see below), but the
DISTINCTiveness of IN isn't being considered.

>
> The good news is that the system *is* aware of the small number of
> distinct values in the table (note the dead-on estimate of the number of
> distinct rows in your other query; which I think is from new-for-7.4
> code, though the required stats have been available since 7.2).
>
> I think it'd probably be possible to make some reasonable estimate of
> the amount of rescanning required, and then inflate the mergejoin cost
> estimate proportionally. I have not gotten around to looking at the
> problem though. Care to take a shot at it?

See above - the estimated merge_join cost appears to be being inflated
already (to 3485661.38). That may be off by an order of magnitude, I
guess, but its higher than the hash_join version. What appears not to be
happening is that the optimiser doesn't realise that the real cost is
less because of:

/*
* If we're doing an IN join, we want to return at most one row per
* outer tuple; so we can stop scanning the inner scan if we matched on
* the previous try.
*/
if (node->js.jointype == JOIN_IN &&
node->hj_MatchedOuter)
node->hj_NeedNewOuter = true;

Actually, from the last plan in my previous mail:

-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)

Why is the actual number of rows 0? Would it be easier to have the
hashing stage discard duplicates for an IN hash? I guess thats the same
as doing a hash-based DISTINCT, though, although the DISTINCT plan has:

-> Hash
-> Subquery Scan "IN_subquery"
-> Unique
-> Index scan using bugs_product_id_idx

so I guess the unique and the hash aren't combined (Not that the unique
is needed - the indexscan should only be returning unique values anyway)

While I'm going through the stats, the number of rows for a JOIN_IN
merge is wrong - its currently |outer_rel->rows * selec;|, but should
probably be |outer_rel->rows * selec * inner_rel->num_distinct_rows;| -
notice the estimated-vs-actual for the final join, of 5580 vs 50000.
5580 would be correct if the IN only returned one value, but thats not
the case. IOW, should the JOIN_IN and JOIN_REVERSE_IN be using the same
cost calcs as JOIN_INNER_UNIQUE and JOIN_OUTER_UNIQUE in
set_joinrel_size_estimates ?

>
> regards, tom lane

Thanks,

Bradley

In response to

Re: New hashed IN code ignores distinctiveness of subquery at 2003-01-26 19:09:49 from Tom Lane

Responses

Re: New hashed IN code ignores distinctiveness of subquery at 2003-01-26 23:51:18 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Bruce Momjian	2003-01-26 23:02:08	Re: Bug #880: COMMENT ON DATABASE depends on current database
Previous Message	Neil Conway	2003-01-26 22:16:54	Re: Bug #883: explain analyze causes postgres to die

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Curt Sampson	2003-01-26 22:48:22	Re: Call for objections: put back OIDs in CREATE TABLE
Previous Message	Bruce Momjian	2003-01-26 20:41:25	Re: Call for objections: put back OIDs in CREATE TABLE AS/SELECT