| From: | Dimitrios Apostolou <jimis(at)gmx(dot)net> |
|---|---|
| To: | David Rowley <dgrowleyml(at)gmail(dot)com> |
| Cc: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Query on partitioned table needs memory n_partitions * work_mem |
| Date: | 2024-07-19 13:31:34 |
| Message-ID: | e5ae2d08-75e1-3726-8905-1328e86453cd@gmx.net |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
On Thu, 18 Jul 2024, David Rowley wrote:
> On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis(at)gmx(dot)net> wrote:
>
>> * The memory is unnecessarily allocated early on, before any partitions
>> are actually aggregated. I know this because I/O is slow on this device
>> and the table sizes are huge, it's simply not possible that postgres
>> went through all partitions and blew up the memory. That would take
>> hours, but the OOM happens seconds after I start the query.
>
> That's interesting. Certainly, there is some memory allocated during
> executor startup, but that amount should be fairly small. Are you
> able to provide a self-contained test case that shows the memory
> blowing up before execution begins?
I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).
-- I create a table with 2k partitions.
CREATE TABLE partitioned_table1(
run_n bigint GENERATED ALWAYS AS IDENTITY,
workitem_n integer NOT NULL,
label_n smallint,
result smallint NOT NULL,
PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);
DO $$
for i in range(0, 2000):
stmt = f'''
CREATE TABLE part_max{i+1}M
PARTITION OF partitioned_table1
FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
'''
plpy.execute(stmt)
$$ LANGUAGE plpython3u;
-- I insert random data. First I insert to all partitions, 1M rows each:
DO $$
for i in range(0, 2000):
stmt = f'''
INSERT INTO partitioned_table1(workitem_n, label_n, result)
SELECT
j-j%4,
CAST(random()*1000 AS INTEGER),
CAST(random()*3 AS INTEGER)
FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1,
1) as j
'''
plpy.info(stmt)
plpy.execute(stmt)
plpy.commit()
$$ LANGUAGE plpython3u;
-- Disable parallel execution and group aggregate:
SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;
SET SESSION work_mem TO '8MB';
-- Now the following query should do a HashAggregate:
SELECT
workitem_n, label_n, bool_or(result IN (2,3))
FROM
partitioned_table1
GROUP BY
workitem_n, label_n
LIMIT 10;
-- How much was the RSS of the backend while the previous query was
-- running? Not that high. But if we insert some million rows to the
-- 1st partition, then it will be much higher.
DO $$
for i in range(0,2000):
stmt = f'''
INSERT INTO partitioned_table1(workitem_n, label_n, result)
SELECT
j%1000000,
CAST(random()*20000 AS INTEGER),
CAST(random()*4 AS INTEGER)
FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
'''
plpy.info(stmt)
plpy.execute(stmt)
plpy.commit()
$$ LANGUAGE plpython3u;
-- Now that same previous query consumes between 8GB and 10GB RSS. The
-- more data I insert (to all partitions?), the more memory the query
-- takes.
Overall:
* I don't see the RSS memory usage (8GB) growing proportionally as I
expected. If I increase work_mem from 4MB to 8MB then I see double RSS
memory usage (from ~4GB to ~8GB). But then if I increase it further the
difference is miniscule and no OOM happens.
* Instead I notice RSS memory usage growing slowly while I insert more and more data
to the table (especially into the 1st partition I think).
* Finally I don't see the memory being free'd by the backend after the
SELECT finishes. The system is relieved only when I disconnect psql and
the backend dies. Not sure if that's by design or not.
>
>> Having wasted long time in that, the minimum I can do is submit a
>> documentation patch. At enable_partitionwise_aggregate someting like
>> "WARNING it can increase the memory usage by at least
>> n_partitions * work_mem". How do I move on for such a patch? Pointers
>> would be appreciated. :-)
>
> I think mentioning something about this in enable_partitionwise_join
> and enable_partitionwise_aggregate is probably wise. I'll propose a
> patch on pgsql-hackers.
David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.
Regards,
Dimitris
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Christoph Berg | 2024-07-19 14:25:47 | Re: Fwd: Regarding tables detach concurrently with run_maintenance_proc() |
| Previous Message | KK CHN | 2024-07-19 13:08:51 | Re: PgbackRest and EDB Query |