RE: Postgres12 looking for possible HashAggregate issue workarounds?

From: João Paulo Luís <joao(dot)luis(at)pdmfc(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: RE: Postgres12 looking for possible HashAggregate issue workarounds?
Date: 2022-12-19 16:50:56
Message-ID: HE1PR08MB2921CDBEC9B5296D0A1F8BD9EFE59@HE1PR08MB2921.eurprd08.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thank you David Rowley (best peformance fix so far)!

nsoamt=> select attname,n_distinct from pg_Stats where tablename = 'sentencesource' and attname = 'sentence';
attname | n_distinct
----------+------------
sentence | 255349
(1 row)

select count(*), count(distinct sentence) from sentencesource;
count | count
------------+-----------
1150174041 | 885946963
(1 row)

-- Seems badly estimated to me.

-- I expect +/-80% of rows to have a distinct value. Manual says -1 is for all rows being distinct.
nsoamt=> ALTER TABLE sentencesource ALTER COLUMN sentence SET (n_distinct = -1);

nsoamt=> ANALYZE VERBOSE sentencesource ;
INFO: analyzing "public.sentencesource"
INFO: "sentencesource": scanned 30000 of 9028500 pages, containing 3819977 live rows and 260307 dead rows; 30000 rows in sample, 1149622078 estimated total rows
ANALYZE

nsoamt=> select attname,n_distinct from pg_Stats where tablename = 'sentencesource' and attname = 'sentence';
attname | n_distinct
----------+------------
sentence | -1
(1 row)

nsoamt=> EXPLAIN SELECT COUNT(*), COUNT(NULLIF(Stchk.haserrors,'f'))
FROM SentenceToolCheck Stchk
WHERE EXISTS (SELECT SSrc.sentence
FROM SentenceSource SSrc, Document Doc
WHERE SSrc.sentence = Stchk.id
AND Doc.id = SSrc.document
AND Doc.source ILIKE '/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=275199757.84..275199757.85 rows=1 width=16)
-> Gather (cost=275199757.62..275199757.83 rows=2 width=16)
Workers Planned: 2
-> Partial Aggregate (cost=275198757.62..275198757.63 rows=1 width=16)
-> Hash Join (cost=228004096.84..273527643.59 rows=222815204 width=1)
Hash Cond: (stchk.id = ssrc.sentence)
-> Parallel Seq Scan on sentencetoolcheck stchk (cost=0.00..35858393.80 rows=232196880 width=9)
-> Hash (cost=209905168.81..209905168.81 rows=1103172722 width=8)
-> Unique (cost=204389305.20..209905168.81 rows=1103172722 width=8)
-> Sort (cost=204389305.20..207147237.01 rows=1103172722 width=8)
Sort Key: ssrc.sentence
-> Hash Join (cost=73287.01..23615773.05 rows=1103172722 width=8)
Hash Cond: (ssrc.document = doc.id)
-> Seq Scan on sentencesource ssrc (cost=0.00..20524720.16 rows=1149622016 width=16)
-> Hash (cost=54327.65..54327.65 rows=1516749 width=4)
-> Seq Scan on document doc (cost=0.00..54327.65 rows=1516749 width=4)
Filter: (source ~~* '/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
JIT:
Functions: 25
Options: Inlining true, Optimization true, Expressions true, Deforming true
(20 rows)

And the query finished in Time: 2637891.352 ms (43:57.891) (the best performance so far, although sentencesource fits in RAM :-)

Curious, now that I've manually set it to -1, who/what will change that setting in the future (not ANALYZE?) ?
It will stay that way until some one else (human user) changes it ? (How do I set it back to "automatic"?)

Hope that there is a way that this poor estimation is fixed in the future releases...

João Luís

Senior Developer

<mailto:%%Email%%>joao(dot)luis(at)pdmfc(dot)com<mailto:joao(dot)luis(at)pdmfc(dot)com>

+351 210 337 700

[https://dlnk.bio/wp-content/uploads/2022/11/assinaturaPDM-Natal-1-1.gif]
[https://www.pdmfc.com/images/email-signature/28-04.png]<https://pdmfc.com> [https://www.pdmfc.com/images/email-signature/28-06.png] <https://www.facebook.com/PDMFC> [https://www.pdmfc.com/images/email-signature/28-05.png] <https://www.linkedin.com/company/pdmfc> [https://www.pdmfc.com/images/email-signature/28-07.png] <https://www.instagram.com/pdmfc.tech> [https://www.pdmfc.com/images/email-signature/28-08.png] <https://www.youtube.com/channel/UCFiu8g5wv10TfMB-OfOaJUA>

Confidentiality
The information in this message is confidential and privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it is prohibited.
Please contact the sender immediately if you have received this message by mistake.
Thank you for your cooperation.

________________________________
De: David Rowley <dgrowleyml(at)gmail(dot)com>
Enviado: 18 de dezembro de 2022 11:06
Para: João Paulo Luís <joao(dot)luis(at)pdmfc(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>; pgsql-performance(at)lists(dot)postgresql(dot)org <pgsql-performance(at)lists(dot)postgresql(dot)org>
Assunto: Re: Postgres12 looking for possible HashAggregate issue workarounds?

[Não costuma receber e-mails de dgrowleyml(at)gmail(dot)com(dot) Saiba por que motivo isto é importante em https://aka.ms/LearnAboutSenderIdentification. ]

CAUTION: External E-mail

On Sun, 18 Dec 2022 at 23:44, João Paulo Luís <joao(dot)luis(at)pdmfc(dot)com> wrote:
> Meanwhile, as a one-time workaround I've disabled the hashagg algorithm,

The way the query planner determines if Hash Aggregate's hash table
will fit in work_mem or not is based on the n_distinct estimate of the
columns being grouped on. You may want to review what analyze set
n_distinct to on this table. That can be done by looking at:

select attname,n_distinct from pg_Stats where tablename =
'sentencesource' and attname = 'sentence';

If what that's set to does not seem realistic, then you can overwrite this with:

ALTER TABLE sentencesource ALTER COLUMN sentence SET (n_distinct = N);

Please see the paragraph in [1] about n_distinct. Using an absolute
value is likely not a great idea if the table is going to grow. You
could maybe give it a better estimate about how many times values are
repeated by setting some negative value, as described in the
documents. You'll need to analyze the table again after changing this
setting.

David

[1] https://www.postgresql.org/docs/12/sql-altertable.html

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Frits Jalvingh 2022-12-19 16:57:42 Re: Fwd: temp_file_limit?
Previous Message Frits Jalvingh 2022-12-19 15:01:21 Re: Fwd: temp_file_limit?