Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Craig Milhiser <craig(at)milhiser(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date: 2024-10-10 03:59:13
Message-ID: cc488d1c-6477-4e20-8bd0-cf152a2e59bf@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 10/9/24 23:28, Craig Milhiser wrote:
> On Oct 8 Andrei Lepikhov wrote
> > I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> > request size ...', but you write about issue with OOM killer. It is two
> > different issues, which one do you have exactly?
>
>
> I started with the Invalid DSA memory allocation error. I was asked to
> try an experimental patch above. Then I got OOM with the patch only
> running parallel. You will see below, there was an OOM but I do not
> believe it is the query.
So, I think the patch works, but you found out one more issue at the
same query. Awesome!

> I ran a new test on an instance with 512 GiB of memory.  After I applied
> the patch, the Invalid DSA memory allocation message was not
> replicated.  Running max_parallel_workers_per_gather = 0, the query took
> ~9.5 seconds and used <1 GiB of memory.  With
> max_parallel_workers_per_gather = 2 the query used  ~170 GiB of memory,
> ~70 GB of temp files were written and the query ran for more than 1 hour
> until I ran out of disk space.
It's fascinating. I have one user report like that, but they also didn't
provide any synthetic test. I think it is almost impossible to create
such reproduction without a minimal understanding of what's happening. I
can imagine only a data skew or a logical bug in this part of the code.
But without direct perf and gdb touch, it is hard to resolve the issue
by just gazing into the code.
Additional actions can provide some food for thought:
1. If you remove aggregates (STRING_AGG, count) from the selection list,
will the problem remain? What about OFFSET 0?
2. Can you build extended statistics on account_id,disabled,user_type_id
and provide an explain (and explain analyse)?
3. Can you use pg_query_state (unfortunately, it needs a patch and
re-compilation) and show us intermediate execution state snapshots?
4. I see a duplicate clause in the query: SGM.OBJECT_ID = U.OBJECT_ID.
For what reason you have it here? can you remove it from the query?
5. One more wild guess: can you analyse how much NULLS contains column
u.users_id at the moment when HashJoin evaluates clause (u.users_id =
cf.user_id)?

[1] https://github.com/postgrespro/pg_query_state

--
regards, Andrei Lepikhov

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-10-10 04:34:02 Re: BUG #18652: Planner can not find pathkey item to sort for query with expression and expression index
Previous Message Richard Guo 2024-10-10 03:52:41 Re: BUG #18652: Planner can not find pathkey item to sort for query with expression and expression index