Re: BUG #13614: Optimizer choosing

From: Matthew Bellew <matthewb(at)labkey(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13614: Optimizer choosing
Date: 2015-10-14 04:08:16
Message-ID: loom.20151014T055051-633@post.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

This issue has been affecting a lot of our customers lately, (perhaps it has
become even more common to use nested loop joins in 9.4.5?). I have created
a repro to show the problem.

Given two table A joined to B and a filter of say selectivity Asel% on A and
Bsel% on B. It is very optimistic to assume that the selectivity of the rows
in A join B is Asel * Bsel. But this seems to be what Postgres assumes. A
very conservative estimate would be that the join selectivity is
min(Asel,Bsel). The conservative estimate is probably not ideal, but is less
likely to lead to a pathological plan (one that is orders of magnitude worse
than if no filters had been applied at all), than using the very optimistic
estimate.

Just guessing that the geometric mean of the optimistic and pessimistic
estimate. sqrt(min(Asel,Bsel) * (Asel*Bsel)) might be close to a good
compromise.

Regardless, or the exact estimation it seems clear that

a) the current estimate for joins over filtered tables it way too optmistic b)
the estimate should not be allowed to float down to 1, 1 should only be used
to mean exactly 1, any other estimate should be capped at some sort of
minimum to help avoid choosing joins that degrade badly if the estimate is
wrong.

SAMPLE 1 - no UNIQUE indexes, on my machine despite estimates, the optimizer
chooses MERGE JOINS and results are very fast.

DROP TABLE IF EXISTS A;
DROP TABLE IF EXISTS B;
DROP TABLE IF EXISTS C;
DROP TABLE IF EXISTS D;
DROP TABLE IF EXISTS E;

SELECT * INTO A
FROM
(
SELECT seq, 100000-seq as rev, (seq+111) % 100000 as rot,
rand, rand%10 as ten, rand%100 as hundred, rand%1000 as thousand,
md5(random()::text) AS descr
FROM (SELECT generate_series(1,100000) AS seq, cast(floor(random()*10000) as
int4) as rand) _
) _;
--CREATE UNIQUE INDEX ON A (seq);
--CREATE UNIQUE INDEX ON A (rev);
--CREATE UNIQUE INDEX ON A (rot);
CREATE INDEX ON A (hundred);

SELECT * INTO B FROM A;
--CREATE UNIQUE INDEX ON B (seq);
--CREATE UNIQUE INDEX ON B (rev);
--CREATE UNIQUE INDEX ON B (rot);
CREATE INDEX ON B (hundred);

SELECT * INTO C FROM A;
--CREATE UNIQUE INDEX ON C (seq);
--CREATE UNIQUE INDEX ON C (rev);
--CREATE UNIQUE INDEX ON C (rot);
CREATE INDEX ON C (hundred);

SELECT * INTO D FROM A;
--CREATE UNIQUE INDEX ON D (seq);
--CREATE UNIQUE INDEX ON D (rev);
--CREATE UNIQUE INDEX ON D (rot);
CREATE INDEX ON D (hundred);

SELECT * INTO E FROM A;
--CREATE UNIQUE INDEX ON E (seq);
--CREATE UNIQUE INDEX ON E (rev);
--CREATE UNIQUE INDEX ON E (rot);
CREATE INDEX ON E (hundred);

SELECT _a.seq, _b.rand, _c.ten, _d.hundred, _e.descr
FROM
(SELECT * FROM A WHERE hundred in (1,2,3,4,5)) _a
JOIN (SELECT * FROM B WHERE hundred in (1,2,3,4,5)) _b ON _a.seq = _b.seq
JOIN (SELECT * FROM C WHERE hundred in (1,2,3,4,5)) _c ON _b.rev = _c.rev
JOIN (SELECT * FROM D WHERE hundred in (1,2,3,4,5)) _d ON _c.seq = _d.seq
JOIN (SELECT * FROM E WHERE hundred in (1,2,3,4,5)) _e ON _d.rot = _e.rot

SAMPLE 2- with UNIQUE indexes, on my machine the optimizer starts using NESTED
LOOPS once the row estimate gets to 1. This case isn't tragically bad (worse
than SAMPLE 1), but we hit pathological cases in the wild that are
exceedingly hard to avoid.

DROP TABLE IF EXISTS A;
DROP TABLE IF EXISTS B;
DROP TABLE IF EXISTS C;
DROP TABLE IF EXISTS D;
DROP TABLE IF EXISTS E;

SELECT * INTO A
FROM
(
SELECT seq, 100000-seq as rev, (seq+111) % 100000 as rot,
rand, rand%10 as ten, rand%100 as hundred, rand%1000 as thousand,
md5(random()::text) AS descr
FROM (SELECT generate_series(1,100000) AS seq, cast(floor(random()*10000) as
int4) as rand) _
) _;
CREATE UNIQUE INDEX ON A (seq);
CREATE UNIQUE INDEX ON A (rev);
CREATE UNIQUE INDEX ON A (rot);
CREATE INDEX ON A (hundred);

SELECT * INTO B FROM A;
CREATE UNIQUE INDEX ON B (seq);
CREATE UNIQUE INDEX ON B (rev);
CREATE UNIQUE INDEX ON B (rot);
CREATE INDEX ON B (hundred);

SELECT * INTO C FROM A;
CREATE UNIQUE INDEX ON C (seq);
CREATE UNIQUE INDEX ON C (rev);
CREATE UNIQUE INDEX ON C (rot);
CREATE INDEX ON C (hundred);

SELECT * INTO D FROM A;
CREATE UNIQUE INDEX ON D (seq);
CREATE UNIQUE INDEX ON D (rev);
CREATE UNIQUE INDEX ON D (rot);
CREATE INDEX ON D (hundred);

SELECT * INTO E FROM A;
CREATE UNIQUE INDEX ON E (seq);
CREATE UNIQUE INDEX ON E (rev);
CREATE UNIQUE INDEX ON E (rot);
CREATE INDEX ON E (hundred);

SELECT _a.seq, _b.rand, _c.ten, _d.hundred, _e.descr
FROM
(SELECT * FROM A WHERE hundred in (1,2,3,4,5)) _a
JOIN (SELECT * FROM B WHERE hundred in (1,2,3,4,5)) _b ON _a.seq = _b.seq
JOIN (SELECT * FROM C WHERE hundred in (1,2,3,4,5)) _c ON _b.rev = _c.rev
JOIN (SELECT * FROM D WHERE hundred in (1,2,3,4,5)) _d ON _c.seq = _d.seq
JOIN (SELECT * FROM E WHERE hundred in (1,2,3,4,5)) _e ON _d.rot = _e.rot

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message John R Pierce 2015-10-14 04:25:47 Re: postgresql database limit check
Previous Message 許耀彰 2015-10-14 03:01:31 postgresql database limit check