Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

From: Craig Milhiser <craig(at)milhiser(dot)com>
To: Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date: 2024-10-13 11:23:05
Message-ID: CA+wnhO3t7xqCiDmGduxgg1Mv5M5R-GeQd0kzVcJKkUFkj91Wbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thomas Munro wrote
> I wonder if there something could be wrong with Parallel Hash Right
Join...Definitely keen to see a reproducer
> with synthetic data if you can come up with one

Andrei Lepikhov wrote
> I can imagine only a data skew or a logical bug in this part of the code.
> But without direct perf and gdb touch, it is hard to resolve the issue
> by just gazing into the code.

Both of you are correct.

I have reproduced the problem with synthetic data. The script is below.
Thank you for your patience with me.

There are comments in the script. Please let me know of any questions or if
you cannot reproduce it. If you want me to file a report via the form, let
me know.

Using Postgres v17 with out of the box configuration.

drop table test_users;
create table test_users (account_id bigint not null, users_id bigint not
null constraint test_users_pkey primary key, first_name varchar(105),
last_name varchar(105), user_image_id bigint);

-- The account we are interested, data numbers are negative to eliminate
duplicates and help with debugging
insert into test_users (account_id, users_id, first_name, last_name,
user_image_id)
SELECT -1, -1 * i, md5(random()::text), md5(random()::text), case when
random() < 0.95 then null else -1 * i end
FROM generate_series(1, 925_000) AS t(i)
;

-- Make enough other records to get the skew to force a Parallel Hash Right
Join and the query breaks
-- Change the "< 0.50" to "< 0.95" to get a skew for a Parallel Hash Left
Join and the query works
-- 0.50 makes a right join and breaks; 0.95 makes a left join and works
-- Changes how many users are in user_image which, relative to the number
of users and accounts, is the key skew that I found
-- Data numbers are positive
insert into test_users(account_id, users_id, first_name, last_name,
user_image_id)
SELECT random(10, 50_000)::bigint, i, md5(random()::text),
md5(random()::text), case when random() < 0.50 then null else i end
FROM generate_series(1, 50_000_000) AS t(i)
;

create index user_img_fk_idx on test_users using btree (user_image_id);

drop table test_user_image;
create table test_user_image(user_image_id bigint not null constraint
test_user_image_pkey primary key);
insert into test_user_image(user_image_id) select user_image_id from
test_users where user_image_id is not null;

ALTER TABLE test_users ADD CONSTRAINT users_user_image_fk FOREIGN KEY
(user_image_id) REFERENCES test_user_image(user_image_id);

commit;
analyze test_users;
analyze test_user_image;

-- at 0 workers the query will work
set max_parallel_workers_per_gather = 0;

SELECT U.USERS_ID
, U.FIRST_NAME
, U.LAST_NAME
FROM test_USERS U
LEFT JOIN test_USER_IMAGE UI
ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
WHERE U.ACCOUNT_ID = -1
GROUP BY U.USERS_ID
, UI.USER_IMAGE_ID
ORDER BY U.LAST_NAME ASC
, U.FIRST_NAME ASC
, U.USERS_ID ASC
LIMIT 20
OFFSET 0
;

set max_parallel_workers_per_gather = 2;
-- Explain the above query. For it to break, a Parallel Hash Right Join is
executed.
-- When a Parallel Hash Left Join is executed, the query works. Switch
between left and right by changing the skew as noted above.
-- when run with stock Postgres 17, the Invalid DSA memory allocation
occurs, which started this thread
-- when run with the patch to fix the Invalid DSA memory allocation, the
OOM occurs.
-- I reproduced the Invalid DSA memory allocation with AWS Aurora v16.2.
Naturally I cannot try patches there. The above was recreated with stock
Postgres v17 on a plain ec2 instance.

Thank you for your time

Craig

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2024-10-13 21:08:52 Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Previous Message Andrei Lepikhov 2024-10-13 09:57:59 Re: Question of Parallel Hash Join on TPC-H Benchmark