Re: pg_stat_activity crashes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stat_activity crashes
Date: 2016-04-21 18:04:52
Message-ID: CA+TgmoYzBHMfc8e0ampp1cptBMtYGjyEDuho4U9MDmEsJm04Ow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 20, 2016 at 10:56 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Wed, 20 Apr 2016 15:14:16 +0200, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote in <571780A8(dot)4070902(at)2ndquadrant(dot)com>
>> I noticed sporadic segfaults when selecting from pg_stat_activity on
>> current HEAD.
>>
>> The culprit is the 53be0b1add7064ca5db3cd884302dfc3268d884e commit
>> which added more wait info into the pg_stat_get_activity(). More
>> specifically, the following code is broken:
>>
>> + proc = BackendPidGetProc(beentry->st_procpid);
>> + wait_event_type = pgstat_get_wait_event_type(proc->wait_event_info);
>>
>> This needs to check if proc is NULL. When reading the code I noticed
>> that the new functions pg_stat_get_backend_wait_event_type() and
>> pg_stat_get_backend_wait_event() suffer from the same problem.
>
> Good catch.

Agreed.

>> Here is PoC patch which fixes the problem. I am wondering if we should
>> raise warning in the pg_stat_get_backend_wait_event_type() and
>> pg_stat_get_backend_wait_event() like the pg_signal_backend() does
>> when proc is NULL instead of just returning NULL which is what this
>> patch does though.
>
> It still makes the two relevant columns in pg_stat_activity
> inconsistent each other since it reads the procarray entry twice
> without a lock on procarray.
>
> The attached patch adds pgstat_get_wait_event_info to read
> wait_event_info in more appropriate way. Then change
> pg_stat_get_wait_event(_type) to take the wait_event_info.
>
> Does this work for you?

This is a hideously way of fixing this problem. The whole point of
storing the wait event in a 4-byte integer is that we already assume
reads of 4 byte integers are atomic and thus no lock is needed. The
only thing we need to do is to prevent the value from being read
twice, and we already have precedent for how to prevent that in
freelist.c.

I've committed a modified version of Petr's patch which I think should
be sufficient. It survived this test on my machine:

pgbench -n -f f -c 64 -j 64 -T 120

Where f contained:

SELECT * FROM pg_stat_activity;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-04-21 18:05:28 Re: pg_stat_activity crashes
Previous Message Kevin Grittner 2016-04-21 17:53:12 Re: [COMMITTERS] pgsql: Inline initial comparisons in TestForOldSnapshot()