From: | Paul Guyot <pguyot(at)kallisys(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #16990: Random PANIC in qemu user context |
Date: | 2021-05-02 20:20:39 |
Message-ID: | 86C24765-95F7-464F-9677-B09A396A5F69@kallisys.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
> Not sure what to tell you, other than "make sure qemu and your
> build toolchain are up-to-date".
In this scenario, I use postgresql 11.11 that was compiled by raspbian folks. I also used the qemu binary provided by ubuntu for focal, which happens to be 4.2 (not the latest).
I found out the corresponding function using readelf to locate the string constant.
For the record, the C function is here:
https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L811
The tight read loop is as follows:
32b548: e28d0004 add r0, sp, #4
32b54c: eb000679 bl 32cf38 <perform_spin_delay@@Base>
32b550: e5943004 ldr r3, [r4, #4]
32b554: e3130201 tst r3, #268435456 ; 0x10000000
32b558: 1afffffa bne 32b548 <RememberSimpleDeadLock@@Base+0xc4>
At address 32b550, it does perform a read, honoring the volatile pointer.
I guess the lock is acquired by the same function:
https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L824
The corresponding code is the following
32b508: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b50c: e1953f9f ldrex r3, [r5]
32b510: e3832201 orr r2, r3, #268435456 ; 0x10000000
32b514: e1851f92 strex r1, r2, [r5]
32b518: e3510000 cmp r1, #0
32b51c: 1afffffa bne 32b50c <RememberSimpleDeadLock@@Base+0x88>
32b520: e3130201 tst r3, #268435456 ; 0x10000000
32b524: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b528: 0a00000e beq 32b568 <RememberSimpleDeadLock@@Base+0xe4>
mcr 15, 0, r0, cr7, cr10, {5} is __sync_synchronize() and based on the previous instructions, r5 is equal to r4+4 as used in the tight loop.
I also guess the corresponding unlock function just follows, and disassembling it reveals the same use of __sync_synchronize().
32b644: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b648: e1932f9f ldrex r2, [r3]
32b64c: e3c22201 bic r2, r2, #268435456 ; 0x10000000
32b650: e1831f92 strex r1, r2, [r3]
32b654: e3510000 cmp r1, #0
32b658: 1afffffa bne 32b648 <RememberSimpleDeadLock@@Base+0x1c4>
32b65c: ee070fba mcr 15, 0, r0, cr7, cr10, {5}
32b660: e8bd8070 pop {r4, r5, r6, pc}
QEMU user emulation documentation mentions something specific to threading on ARM.
https://qemu.readthedocs.io/en/latest/user/main.html
> Threading:
> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each emulated thread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock in order to preserve their semantics.
I have yet to determine what impact it could have here. Can we imagine a situation where the memory barrier was not honored and an unlock would be overwritten with a lock?
Eventually, I have tried to run the whole script with taskset -c 0 (which is fine with the tests as the target system, a Raspberry Pi Zero, is single core, while GitHub Linux runners have 2 vCPUs).
https://github.com/pguyot/pynab/commit/91011e68e446c69e317fd1198c58f85ff0cd5fb1
https://github.com/pguyot/pynab/runs/2486051700?check_suite_focus=true
I ran it four times so far, and no postgresql PANIC happens. So your hypothesis of a bug (limitation) of qemu 4.2 seems probable…
FYI, newer ARM architectures, starting with armv7l, have a dedicated instruction for memory barriers which is not used here as it is not recognized by Raspberry PI Zero CPU.
Paul
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2021-05-02 22:19:58 | Re: BUG #16990: Random PANIC in qemu user context |
Previous Message | Alexander Korotkov | 2021-05-02 18:41:14 | Re: websearch_to_tsquery() returns queries that don't match to_tsvector() |