From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Dimitrios Apostolou <jimis(at)gmx(dot)net> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: array_agg() does not stop aggregating according to HAVING clause |
Date: | 2024-08-17 15:15:11 |
Message-ID: | 954894.1723907711@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
> I have a query that goes through *billions* of rows and for the columns
> that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
> selects all the IDs of the entries (array_agg(run_n)). Here is the full
> query:
> INSERT INTO infrequent_datatags_in_this_chunk
> SELECT datatag, datatags.datatag_n, array_agg(run_n)
> FROM runs_raw
> JOIN datatags USING(datatag_n)
> WHERE workitem_n >= 295
> AND workitem_n < 714218
> AND datatag IS NOT NULL
> GROUP BY datatags.datatag_n
> HAVING count(datatag_n) < 10
> AND count(datatag_n) > 0 -- Not really needed because of the JOIN above
> ;
> The problem is that this is extremely slow (5 hours), most likely because
> it creates tens of gigabytes of temporary files as I see in the logs. I
> suspect that it is writing to disk the array_agg(run_n) of all entries and
> not only those HAVING count(datatag_n)<10.
Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN. There's no way to apply the HAVING filter until
after the aggregation is finished.
I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.
> How do I tell postgres to stop aggregating when count>=10?
The only way to do this would be to do two separate passes of
aggregation in separate sub-queries. Perhaps like
WITH rare AS (
SELECT datatag_n
FROM runs_raw
WHERE workitem_n >= 295
AND workitem_n < 714218
AND datatag IS NOT NULL
GROUP BY datatag_n
HAVING count(datatag_n) < 10
AND count(datatag_n) > 0
)
INSERT INTO infrequent_datatags_in_this_chunk
SELECT datatag, datatags.datatag_n, array_agg(run_n)
FROM runs_raw
JOIN datatags USING(datatag_n)
JOIN rare USING(datatag_n)
GROUP BY datatags.datatag_n
;
I can't tell from what you said which level the workitem_n and
datatag conditions go at, so this is just a draft-quality
query. But I think the structure is basically okay, given
that you said datatag_n is unique in datatags (so there's no
need to join it in the WITH step).
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitrios Apostolou | 2024-08-17 15:35:39 | Re: array_agg() does not stop aggregating according to HAVING clause |
Previous Message | Dimitrios Apostolou | 2024-08-17 14:37:25 | array_agg() does not stop aggregating according to HAVING clause |