Improve statistics estimation considering GROUP-BY as a 'uniqueiser'

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Improve statistics estimation considering GROUP-BY as a 'uniqueiser'
Date: 2024-09-19 07:55:19
Message-ID: 50fe6779-ee2d-4256-bc64-cd661bc4029a@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
One database app developer complied that He constantly has a problem
with row estimations in joins over groupings and provided the demo example:

CREATE TABLE line (id int PRIMARY KEY, docId int, amount numeric);
CREATE INDEX line_doc ON line (docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 10) AS id,
generate_series(1, 25000) AS docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 20) AS id,
generate_series(25001, 50000) AS docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 50) AS id,
generate_series(50001, 75000) AS docid);
INSERT INTO line (id, docId, amount)
(SELECT docId*100 + id AS id, docId, random() AS amount
FROM generate_series(1, 100) AS id,
generate_series(75001, 100000) AS docid);
CREATE TABLE tmp (id int PRIMARY KEY);
INSERT INTO tmp (id) SELECT * FROM generate_series(1, 50);
ANALYZE line, tmp;

EXPLAIN
SELECT tmp.id, sq.amount FROM tmp
LEFT JOIN
(SELECT docid, SUM(amount) AS amount FROM line
JOIN tmp ON tmp.id = docid GROUP BY 1) sq
ON sq.docid = tmp.id;

with this query we have bad estimation of the top JOIN:

Hash Right Join (rows=855)
Hash Cond: (line.docid = tmp.id)
-> GroupAggregate (cost=3.49..117.25 rows=3420 width=36)
Group Key: line.docid
-> Merge Join (cost=3.49..57.40 rows=3420 width=15)
Merge Cond: (line.docid = tmp_1.id)
...

This wrong prediction makes things much worse if the query has more
upper query blocks.
His question was: Why not consider the grouping column unique in the
upper query block? It could improve estimations.
After a thorough investigation, I discovered that in commit 4767bc8ff2
most of the work was already done for DISTINCT clauses. So, why not do
the same for grouping? A sketch of the patch is attached.
As I see it, grouping in this sense works quite similarly to DISTINCT,
and we have no reason to ignore it. After applying the patch, you can
see that prediction has been improved:

Hash Right Join (cost=5.62..162.56 rows=50 width=36)

--
regards, Andrei Lepikhov

Attachment Content-Type Size
0001-Improve-statistics-estimation-considering-GROUP-BY-a.patch text/plain 5.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-09-19 08:03:09 Re: Partitioned tables and [un]loggedness
Previous Message Bertrand Drouvot 2024-09-19 07:54:21 Re: Track the amount of time waiting due to cost_delay