Re: dsa_allocate could not find 4 free pages

From: Mark Dilger <hornschnorter(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: dsa_allocate could not find 4 free pages
Date: 2017-12-06 00:52:49
Message-ID: 25884857-F310-4C10-AC97-3C85A5F2D8FD@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> On Dec 5, 2017, at 4:07 PM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Wed, Dec 6, 2017 at 9:35 AM, Mark Dilger <hornschnorter(at)gmail(dot)com> wrote:
>>> On Dec 5, 2017, at 11:25 AM, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan?
>>
>> This was encountered and logged by a java client. The only data I got was:
>>
>> org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free pages
>> Where: parallel worker
>
> This means that the DSA area is corrupted. Presumably
> get_best_segment(area, 4) returned a segment that wasn't actually good
> for 4 pages, either because it was incorrectly binned or because its
> free space btree was corrupted. Another path would be that
> make_new_segment(area, 4) returned a segment that couldn't find 4
> pages, but that seems unlikely.
>
>> [query plan with one Gather and no Parallel Bitmap Heap Scan]
>
> I'm not sure why this plan would ever call dsa_allocate().
>
>> [query plan with no Gather but plenty of Btimap Heap Scans]
>
> And this one certainly can't. I guess you must sometimes get a
> different variation that has Gather nodes and uses Parallel Bitmap
> Heap Scan.

Yes, I can believe that the plan is sometimes different. This error has
occurred several times now, but it is still rather infrequent, so either the
plan that triggers it is rare, or the bug is intermittent even with the same
plan being chosen, or perhaps both.

> Then the question is whether the es_query_dsa multiple
> Gather bug can explain this: for example, if dsa_free(wrong_dsa_area,
> p) was called, perhaps it could produce this type of corruption.
> Otherwise we have a different bug. Any clues on how to reproduce the
> problem would be very welcome.

I have written (and rewritten, and rewritten) a tap test in the hopes of
getting a test case that reproduces this reliably (or even once), but
without luck so far. I will keep trying.

mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2017-12-06 01:11:22 Re: [HACKERS] Proposal: Local indexes for partitioned table
Previous Message David Rowley 2017-12-06 00:42:02 Re: [HACKERS] Proposal: Local indexes for partitioned table