From: | Vijaykumar Jain <vjain(at)opentable(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: [External] Re: simple query on why a merge join plan got selected |
Date: | 2018-12-17 14:32:42 |
Message-ID: | CAE7uO5i0p4wuCb-QrkLb8RbUqMCHhT+9zUF3xfBGadtF_gy7yw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
Thanks a lot Tom, as always :)
We generally do not have so many duplicates in production, so maybe this is
an edge case but I am happy with the explanation and the code reference for
the analysis.
I’ll also play with default statistic target to see what changes by
increasing the value.
On Sun, 16 Dec 2018 at 5:52 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Vijaykumar Jain <vjain(at)opentable(dot)com> writes:
> > I was just playing with exploring joins and plans i came across this
> > create table t1(a int);
> > create table t2(a int);
> > insert into t1 select (x % 10) from generate_series(1, 100000) x;
> > insert into t2 select (x % 100) from generate_series(1, 100000) x;
> > ...
> > select * from t1 join t2 using (a);
>
> Hm. This is a fairly extreme case for mergejoining. In the first place,
> because of the disparity in the key ranges (t1.a goes from 0..9, t2.a
> from 0..99) the planner can figure out that a merge join can stop after
> scanning only 10% of t2. That doesn't help much here, since we still
> have to sort all of t2, but nonetheless the planner is going to take
> that into account. In the second place, because you have so many
> duplicate values, most rows in t1 will require "rescanning" 1000 rows
> that were already read and joined to the previous row of t1 (assuming
> t1 is on the left of the join; it's worse if t2 is on the left).
>
> The planner estimates each of those situations properly, but it looks
> to me like it is not handling the combination of both effects correctly.
> In costsize.c we've got
>
> /*
> * The number of tuple comparisons needed is approximately number of
> outer
> * rows plus number of inner rows plus number of rescanned tuples (can
> we
> * refine this?). At each one, we need to evaluate the mergejoin
> quals.
> */
> startup_cost += merge_qual_cost.startup;
> startup_cost += merge_qual_cost.per_tuple *
> (outer_skip_rows + inner_skip_rows * rescanratio);
> run_cost += merge_qual_cost.per_tuple *
> ((outer_rows - outer_skip_rows) +
> (inner_rows - inner_skip_rows) * rescanratio);
>
> where outer_rows and inner_rows are the numbers of rows we're predicting
> to actually read from each input, the xxx_skip_rows values are zero for
> this example, and rescanratio was previously computed as
>
> /* We'll inflate various costs this much to account for rescanning */
> rescanratio = 1.0 + (rescannedtuples / inner_path_rows);
>
> where inner_path_rows is the *total* size of the inner relation,
> including rows that we're predicting won't get read because of the
> stop-short effect.
>
> As far as I can tell, that comment's claim about the number of tuple
> comparisons needed is on-target ... but the code is computing a number
> of tuple comparisons 10x less than that. The reason is that rescanratio
> is wrong: it should be
>
> rescanratio = 1.0 + (rescannedtuples / inner_rows);
>
> instead, so that it's something that makes sense to multiply inner_rows
> by. In the existing uses of rescanratio, one multiplies it by
> inner_path_rows and needs to be changed to inner_rows to agree with
> this definition, but the other uses are already consistent with this.
>
> This doesn't make a significant difference if either rescannedtuples
> is small, or inner_rows isn't much less than inner_path_rows. But
> when neither is true, we can greatly underestimate the number of tuple
> comparisons we'll have to do, as well as the number of re-fetches from
> the inner plan node. I think in practice it doesn't matter that often,
> because in such situations we'd usually not have picked a mergejoin
> anyway. But in your example the buggy mergejoin cost estimate is about
> 10% less than the hashjoin cost estimate, so we go with mergejoin.
>
> The attached proposed patch fixes this, raising the mergejoin cost
> estimate to about 35% more than the hashjoin estimate, which seems
> a lot closer to reality. It doesn't seem to change any results in
> the regression tests, which I find unsurprising: there are cases
> like this in the tests, but as I just said, they pick hashjoins
> already.
>
> Also interesting is that after this fix, the estimated costs of a
> mergejoin for this example are about the same whether t1 or t2 is on
> the left. I think that's right: t2-on-the-left has 10x more rescanning
> to do per outer tuple, but it stops after scanning only 10% of the
> outer relation, canceling that out.
>
> I'm not sure whether to back-patch this. It's a pretty clear thinko,
> but there's the question of whether we'd risk destabilizing plan
> choices that are working OK in the real world.
>
> regards, tom lane
>
> --
Regards,
Vijay
From | Date | Subject | |
---|---|---|---|
Next Message | rob stone | 2018-12-17 14:38:24 | Re: Creating 2D arrays for pg_copy_from, reading tab-delimted text file that contains comma and double quotes |
Previous Message | hamann.w | 2018-12-17 14:31:03 | Re: conditionally terminate psql script |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2018-12-17 14:49:01 | Re: slight tweaks to documentation about runtime pruning |
Previous Message | Corey Huinker | 2018-12-17 14:32:09 | Referential Integrity Checks with Statement-level Triggers |