From: | Craig Milhiser <craig(at)milhiser(dot)com> |
---|---|
To: | Andrei Lepikhov <lepihov(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker |
Date: | 2024-10-09 16:28:24 |
Message-ID: | CA+wnhO2sHcUOfi6_gvaoPMk=2DsKUOWkDjdUtGxJOguQOXDvOQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Oct 8 Andrei Lepikhov wrote
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?
I started with the Invalid DSA memory allocation error. I was asked to try
an experimental patch above. Then I got OOM with the patch only running
parallel. You will see below, there was an OOM but I do not believe it is
the query.
Thanks for the push on OOM. I should have ran this test earlier.
v17.0 and out of the box Postgres configuration.
I ran a new test on an instance with 512 GiB of memory. After I applied
the patch, the Invalid DSA memory allocation message was not replicated.
Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds
and used <1 GiB of memory. With max_parallel_workers_per_gather = 2 the
query used ~170 GiB of memory, ~70 GB of temp files were written and the
query ran for more than 1 hour until I ran out of disk space.
I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM
and 70+GB of temp files with the patch. Only when using 2 parallel workers
per gather.
The new test:
This morning I increased the machine size from 32 GiB to 512 GiB RAM.
With the patch applied and max_parallel_workers_per_gather = 0 the query
worked in ~9.5 seconds at steady state. While it was running I captured
memory. I ran the query a few times earlier to get the buffers loaded.
total used free shared buff/cache
available
Mem: 493Gi 3.5Gi 484Gi 142Mi 8.4Gi
489Gi
Swap: 0B 0B 0B
With the patch applied and max_parallel_workers_per_gather = 2; the query
ran for more than 1 hour. During that time memory settled at:
total used free shared buff/cache
available
Mem: 493Gi 178Gi 209Gi 1.9Gi 110Gi
314Gi
Swap: 0B 0B 0B
Then the machine ran out of disk space: ERROR: could not write to file
"base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space
left on device
I captured top as well during the run.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
4951 postgres 20 0 46.8g 45.5g 1.9g D 5.6 9.2 2:40.40
postgres
4942 postgres 20 0 68.9g 65.6g 1.9g D 5.3 13.3 3:25.35
postgres
4952 postgres 20 0 68.4g 65.2g 1.9g D 5.3 13.2 3:07.43
postgres
After rebooting:
df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 266G 197G 70G 74% /
As you mentioned there are string aggregations. I ran with parallel=0 and
did some analysis. The aggregations do not seem to be creating something
that is out of line.
select max(length(groups)), sum(length(groups)), max(length(groupnames)),
sum(length(groupnames)) from milhiser_test;
max | sum | max | sum
-----+---------+-----+----------
143 | 6557620 | 499 | 22790616
(1 row)
Perhaps this is a different problem than the "invalid DSA memory alloc".
The patch might have addressed that problem and this is another issue. From
< 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving
from parallel = 0 to 2 seems something is off.
Summary before this test:
Before the patch linked above, I was receiving "ERROR: invalid DSA memory
alloc request size 1879048192" when I ran the query with
max_parallel_workers_per_gather = 2.
Before the patch with max_parallel_workers_per_gather = 0 the query worked
in ~10 seconds at steady state.
I applied the patch to v17.0 source, rebuilt, and passed tests.
With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds
at steady state and took < 1 GiB of memory.
With max_parallel_workers_per_gather = 2, the machine ran out of memory.
This was a 32 GiB machine. The free memory when running without parallel
was ~30 GiB free.
Thanks
On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
> On 10/7/24 18:42, Craig Milhiser wrote:
> >
> > On Oct 1, 2024 Andrei Lepikhov wrote
> >
> > > Can you provide an explain of this query?
> >
> > Apologies for the delay. I have been travelling since Wednesday night.
> > Thanks for your help and time with this issue.
> >
> > Below is the query, with specific values redacted. An explain with
> > max_parallel_workers_per_gather = 2 and explain analyze
> > max_parallel_workers_per_gather = 0.
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?
>
> OOM killer can be explained easily, because I see huge string_agg
> aggregate - workers can utilise memory more intensively. For now,
> explain of an Aggregate node don't show information about factual sort
> operation of each aggregate and memory consumption.
>
> --
> regards, Andrei Lepikhov
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2024-10-09 19:54:41 | Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker |
Previous Message | Tom Lane | 2024-10-09 14:51:54 | Re: BUG #18647: INSERT statements execute functions twice. |