[Bug] Heap Use After Free in Window Aggregate Execution

From: Jayesh Dehankar <jayesh(dot)dp(at)zohocorp(dot)com>
To: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Cc: "zlabs-cstore(at)zohocorp(dot)com" <zlabs-cstore(at)zohocorp(dot)com>, "pgsql-hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "pgsql-bugs" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: [Bug] Heap Use After Free in Window Aggregate Execution
Date: 2024-11-13 15:22:11
Message-ID: 193261e2c4d.3dd3cd7c1842.871636075166132237@zohocorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Developers,

We have discovered a bug in PostgreSQL v16.3 related to a top-level window aggregate with a partition-by clause. The issue occurs when the run condition fails, causing the window aggregate status to change from WINDOWAGG_RUN to WINDOWAGG_PASSTHROUGH_STRICT mode. The bug is present in the latest STABLE branch.

What's the Issue?

During window function execution, the first window function is evaluated, and the per-tuple result is stored in econtext->ecxt_aggvalues. For instance, if a window function like lead is evaluated on varlena columns, the result is copied to (econtext)->ecxt_per_tuple_memory. This memory is reset before beginning window function evaluation for each row. When the run condition fails for a row in pass-through mode, the window aggregate status changes to WINDOWAGG_PASSTHROUGH_STRICT, and execution proceeds to the next tuple if rows remain in the partition.

During the next tuple's execution, (econtext)->ecxt_per_tuple_memory is reset, and the window function is skipped due to WINDOWAGG_PASSTHROUGH_STRICT mode. However, the ExecProject function still executes for the tuple to store aggregate results in the result slot. Since econtext->ecxt_aggvalues is not set to NULL after the run condition fails, it attempts to access the previous tuple's window function result for the current row. When the window function operates on varlena columns, this results in invalid memory access at MakeExpandedObjectReadOnlyInternal(state->resvalue); (execExprInterp.c).

This bug occurs immediately after a failed run condition, particularly when using window functions like lead() on varlena columns.

Question: Why does ExecProject still execute if the current window aggregate status is WINDOWAGG_PASSTHROUGH_STRICT mode?

Possible Fixes

1) Skip the partition immediately after a run condition fails for a row, using windowaggstatus and winstate->partition_spooled info to begin processing the next partition (windowagg_fix1.patch)

2) Set aggregate values to NULL upon run condition failure in WINDOWAGG_PASSTHROUGH_STRICT mode (windowagg_fix2.patch)

How was the Bug Detected?

In our private project, we use a custom malloc allocator to enhance AddressSanitizer's effectiveness in detecting memory-related bugs. Up to PostgreSQL v14, this allocator was implemented as a custom solution. However, starting from version 16, PostgreSQL restricts the use of custom allocators. Consequently, we integrated this allocator directly into PostgreSQL and are using it to run regressions (malloc_allocator.patch)

Our custom allocator precisely allocates the user's requested size, and pfree operations immediately release pointers without adding them to any freelist. Though this approach incurs performance degradation, it allows AddressSanitizer to detect use-after-free bugs more effectively.

Why Aset Context Missed the Bug:

The Aset context did not detect the bug because it did not immediately free pointers, instead added them to a freelist.

Steps to Reproduce the Issue Using Custom Malloc Allocator

1) Apply the attached malloc_allocator.patch to the master branch (#commit: bfeeb065ea2c870cf4d9dfcd552d23d72432e692) and compile PG with the --enable-asan flag.

2) Execute below SQL queries:

create table issue(a int, f text);

insert into issue values (1, 'aa'), (1, 'bb');

select * from ( select row_number() over (partition by a) as first, lead(f) over (partition by a) as third from issue) emp where first < 1;

NOTE: Running regressions with our custom allocator may reveal additional memory-related bugs or crashes in the master branch.

Please let me know if you have any questions or would like further details.

Thanks & Regards,
Jayesh Dehankar
Member Technical Staff
ZOHO Corporation

Attachment Content-Type Size
windowagg_fix1.patch application/octet-stream 1.1 KB
windowagg_fix2.patch application/octet-stream 2.3 KB
malloc_allocator.patch application/octet-stream 17.4 KB
backtrace_asan.log application/octet-stream 14.2 KB

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-11-13 15:39:12 BUG #18704: Installing postgis fails due to depencies
Previous Message Kritika Agarwal 2024-11-13 10:19:56 Re: BUG #18699: Checksum verification failed for: edb_pgagent_pg17.app.zip