Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

From: Craig Milhiser <craig(at)milhiser(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date: 2024-10-09 16:28:24
Message-ID: CA+wnhO2sHcUOfi6_gvaoPMk=2DsKUOWkDjdUtGxJOguQOXDvOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Oct 8 Andrei Lepikhov wrote
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?

I started with the Invalid DSA memory allocation error. I was asked to try
an experimental patch above. Then I got OOM with the patch only running
parallel. You will see below, there was an OOM but I do not believe it is
the query.

Thanks for the push on OOM. I should have ran this test earlier.

v17.0 and out of the box Postgres configuration.

I ran a new test on an instance with 512 GiB of memory. After I applied
the patch, the Invalid DSA memory allocation message was not replicated.
Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds
and used <1 GiB of memory. With max_parallel_workers_per_gather = 2 the
query used ~170 GiB of memory, ~70 GB of temp files were written and the
query ran for more than 1 hour until I ran out of disk space.

I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM
and 70+GB of temp files with the patch. Only when using 2 parallel workers
per gather.

The new test:

This morning I increased the machine size from 32 GiB to 512 GiB RAM.

With the patch applied and max_parallel_workers_per_gather = 0 the query
worked in ~9.5 seconds at steady state. While it was running I captured
memory. I ran the query a few times earlier to get the buffers loaded.

total used free shared buff/cache
available
Mem: 493Gi 3.5Gi 484Gi 142Mi 8.4Gi
489Gi
Swap: 0B 0B 0B

With the patch applied and max_parallel_workers_per_gather = 2; the query
ran for more than 1 hour. During that time memory settled at:
total used free shared buff/cache
available
Mem: 493Gi 178Gi 209Gi 1.9Gi 110Gi
314Gi
Swap: 0B 0B 0B

Then the machine ran out of disk space: ERROR: could not write to file
"base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space
left on device

I captured top as well during the run.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
4951 postgres 20 0 46.8g 45.5g 1.9g D 5.6 9.2 2:40.40
postgres
4942 postgres 20 0 68.9g 65.6g 1.9g D 5.3 13.3 3:25.35
postgres
4952 postgres 20 0 68.4g 65.2g 1.9g D 5.3 13.2 3:07.43
postgres

After rebooting:
df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 266G 197G 70G 74% /

As you mentioned there are string aggregations. I ran with parallel=0 and
did some analysis. The aggregations do not seem to be creating something
that is out of line.

select max(length(groups)), sum(length(groups)), max(length(groupnames)),
sum(length(groupnames)) from milhiser_test;
max | sum | max | sum
-----+---------+-----+----------
143 | 6557620 | 499 | 22790616
(1 row)

Perhaps this is a different problem than the "invalid DSA memory alloc".
The patch might have addressed that problem and this is another issue. From
< 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving
from parallel = 0 to 2 seems something is off.

Summary before this test:
Before the patch linked above, I was receiving "ERROR: invalid DSA memory
alloc request size 1879048192" when I ran the query with
max_parallel_workers_per_gather = 2.

Before the patch with max_parallel_workers_per_gather = 0 the query worked
in ~10 seconds at steady state.

I applied the patch to v17.0 source, rebuilt, and passed tests.

With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds
at steady state and took < 1 GiB of memory.

With max_parallel_workers_per_gather = 2, the machine ran out of memory.
This was a 32 GiB machine. The free memory when running without parallel
was ~30 GiB free.

Thanks

On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:

> On 10/7/24 18:42, Craig Milhiser wrote:
> >
> > On Oct 1, 2024 Andrei Lepikhov wrote
> >
> > > Can you provide an explain of this query?
> >
> > Apologies for the delay. I have been travelling since Wednesday night.
> > Thanks for your help and time with this issue.
> >
> > Below is the query, with specific values redacted. An explain with
> > max_parallel_workers_per_gather = 2 and explain analyze
> > max_parallel_workers_per_gather = 0.
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?
>
> OOM killer can be explained easily, because I see huge string_agg
> aggregate - workers can utilise memory more intensively. For now,
> explain of an Aggregate node don't show information about factual sort
> operation of each aggregate and memory consumption.
>
> --
> regards, Andrei Lepikhov
>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2024-10-09 19:54:41 Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Previous Message Tom Lane 2024-10-09 14:51:54 Re: BUG #18647: INSERT statements execute functions twice.