Quick Links

Re: dsa_allocate() faliure

From:	Fabio Isabettini <fisabettini(at)voipfuture(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Arne Roland <A(dot)Roland(at)index(dot)de>, Sand Stone <sand(dot)m(dot)stone(at)gmail(dot)com>, Rick Otten <rottenwindfish(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: dsa_allocate() faliure
Date:	2019-01-29 11:32:47
Message-ID:	45C7466A-7D67-4BB4-829C-F66CBB2961DB@voipfuture.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

Hello,
we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages

The query reads remotely (via pl/proxy) tables containing a lot of data (up to millions or rows for each table/node) after a remote “group by" returns to the caller “master” node only a few hundreds of rows from each “slave” node.
The tables are partitioned using the INHERITANCE method that we are using since years with no issue. All tables have the same columns structure and number, about 300 columns. In the query there are no join, only a variable set of partitions depending on the date range.
The “REMOTE FATAL” refers to the pl/proxy that runs on 2 different slaves, [a0] and [a1], nodes with identical configuration and database structure, but it seems to fail only on node [a1].
When we get the error if we reduce the date range and therefore the quantity of data read, the error disappears, the same if we set max_parallel_workers_per_gather = 0.
Obviously we cannot force the user to use short periods of time to avoid the error and so we have disabled the parallel query feature for the time being.
It is difficult to reproduce the issue because not always the user gets the error, furthermore re-running the same query in different moments/days it usually works. It is a kind of weird.
We would like not to stop the Production system and upgrade it to PG11. And even though would this guarantee a permanent fix?
Any suggestion?

Regards,
Fabio Isabettini
Voipfuture (Germany)

The failing node [a1] configuration:

OS: Centos 7 kernerl 3.10.0-862.11.6.el7.x86_64
Postgres: postgres-10.5-862.11.6.1
RAM: 256 GB (The main server containing the master node and [a0] node, the slave that has no issue, has 384 GB of RAM)
CPU cores: 32

shared_buffers = 64GB
max_worker_processes = 32
max_parallel_workers_per_gather = 8
max_parallel_workers = 32

> On 28. Jan 2019, at 19:56:01, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Tue, Jan 29, 2019 at 2:50 AM Arne Roland <A(dot)Roland(at)index(dot)de> wrote:
>> does anybody have any idea what goes wrong here? Is there some additional information that could be helpful?
>
> Hi Arne,
>
> This seems to be a bug; that error should not be reached. I wonder if
> it is a different manifestation of the bug reported as #15585 (ie some
> type of rare corruption). Are you able to reproduce this
> consistently? Can you please show the query plan?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

In response to

Re: dsa_allocate() faliure at 2019-01-28 18:56:01 from Thomas Munro

Responses

Re: dsa_allocate() faliure at 2019-01-30 03:13:14 from Thomas Munro

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kyotaro HORIGUCHI	2019-01-29 11:32:54	ALTER SESSION
Previous Message	Amit Kapila	2019-01-29 10:55:52	Re: WIP: Avoid creation of the free space map for small tables

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Saurabh Nanda	2019-01-29 11:39:14	Will higher shared_buffers improve tpcb-like benchmarks?
Previous Message	Shreeyansh Dba	2019-01-29 09:33:17	Re: pg_locks - what is a virtualxid locktype