Re: Some bogus results from prairiedog

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Some bogus results from prairiedog
Date: 2014-07-22 15:41:20
Message-ID: CA+TgmoaiVe+oioiLno8oFEFX+bdONOs9zDQWk_pdqvweQXCObA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 22, 2014 at 12:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> According to
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2014-07-21%2022%3A36%3A55
> prairiedog saw a crash in "make check" on the 9.4 branch earlier tonight;
> but there's not a lot of evidence as to why in the buildfarm report,
> because the postmaster log file is truncated well before where things got
> interesting. Fortunately, I was able to capture a copy of check.log
> before it got overwritten by the next run. I find that the place where
> the webserver report stops matches this section of check.log:
>
> [53cd99bb.134a:158] LOG: statement: create index test_range_gist_idx on test_range_gist using gist (ir);
> [53cd99bb.134a:159] LOG: statement: insert into test_range_gist select int4range(g, g+10) from generate_series(1,2000) g;
> ^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^\
> @^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)^@^(at)[53cd99ba(dot)1344:329] LOG: statement: INSERT INTO num_exp_div VALUES (7,8,'-1108.80577182462841041118');
> [53cd99ba.1344:330] LOG: statement: INSERT INTO num_exp_add VALUES (7,9,'-107955289.045047420');
> [53cd99ba.1344:331] LOG: statement: INSERT INTO num_exp_sub VALUES (7,9,'-58101680.954952580');
>
> The ^@'s represent nul bytes, which I find runs of elsewhere in the file
> as well. I think they are an artifact of OS X buffering policy caused by
> multiple processes writing into the same file without any interlocks.
> Perhaps we ought to consider making buildfarm runs use the logging
> collector by default? But in any case, it seems uncool that either the
> buildfarm log-upload process, or the buildfarm web server, is unable to
> cope with log files containing nul bytes.
>
> Anyway, to cut to the chase, the crash seems to be from this:
>
> TRAP: FailedAssertion("!(FastPathStrongRelationLocks->count[fasthashcode] > 0)", File: "lock.c", Line: 2957)
>
> which the postmaster shortly later says this about:
>
> [53cd99b6.130e:2] LOG: server process (PID 5230) was terminated by signal 6: Abort trap
> [53cd99b6.130e:3] DETAIL: Failed process was running: ROLLBACK PREPARED 'foo1';
> [53cd99b6.130e:4] LOG: terminating any other active server processes
>
> So there is still something rotten in the fastpath lock logic.

Gosh, that sucks.

The inconstancy of this problem would seem to suggest some kind of
locking bug rather than a flat-out concurrency issue, but it looks to
me like everything relevant is marked volatile. But ... maybe the
problem could be in s_lock.h? That machine is powerpc, and powerpc's
definition of S_UNLOCK looks like this:

#ifdef USE_PPC_LWSYNC
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ (" lwsync \n"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#else
#define S_UNLOCK(lock) \
do \
{ \
__asm__ __volatile__ (" sync \n"); \
*((volatile slock_t *) (lock)) = 0; \
} while (0)
#endif /* USE_PPC_LWSYNC */

I believe Andres recently reported that this isn't enough to prevent
compiler reordering; for that, the __asm__ __volatile___ would need to
be augmented with ::: "memory". If that's correct, then the compiler
could decide to issue the volatile store before the lwsync or sync.
Then, the CPU could decide to perform the volatile store to the
spinlock before performing the update to FastPathStrongRelationLocks.
That would explain this problem.

The other possible explanation (at least, AFAICS) is that
lock_twophase_postcommit() is trying to release a strong lock count
that it doesn't actually hold. That would indicate a
strong-lock-count handling bug upstream someplace - e.g. that the
count got released when the transaction was prepared, or somesuch. I
certainly can't rule that out, although I don't know exactly where to
look. We could add an assertion to AtPrepare_Locks(), just before
setting locallock->holdsStrongLockCount = FALSE, that
locallock->holdsStrongLockCount is true if and only if the lock tag
and more suggest that it ought to be. But that's really just guessing
wildly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-07-22 15:51:18 Re: [COMMITTERS] pgsql: Diagnose incompatible OpenLDAP versions during build and test.
Previous Message MauMau 2014-07-22 15:13:26 Re: [bug fix] Suppress "autovacuum: found orphan temp table" message