From: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> |
---|---|
To: | torikoshia <torikoshia(at)oss(dot)nttdata(dot)com> |
Cc: | Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: adding wait_start column to pg_locks |
Date: | 2021-02-09 13:54:49 |
Message-ID: | 64102638-7fc3-1941-12f1-e139e3463c03@oss.nttdata.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2021/02/09 19:11, Fujii Masao wrote:
>
>
> On 2021/02/09 18:13, Fujii Masao wrote:
>>
>>
>> On 2021/02/09 17:48, torikoshia wrote:
>>> On 2021-02-05 18:49, Fujii Masao wrote:
>>>> On 2021/02/05 0:03, torikoshia wrote:
>>>>> On 2021-02-03 11:23, Fujii Masao wrote:
>>>>>>> 64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating "waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock when reading "waitStart"?
>>>>>>
>>>>>> Also it might be worth thinking to use 64-bit atomic operations like
>>>>>> pg_atomic_read_u64(), for that.
>>>>>
>>>>> Thanks for your suggestion and advice!
>>>>>
>>>>> In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().
>>>>>
>>>>> waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and pg_atomic_write_xxx only supports unsigned int, so I cast the type.
>>>>>
>>>>> I may be using these functions not correctly, so if something is wrong, I would appreciate any comments.
>>>>>
>>>>>
>>>>> About the documentation, since your suggestion seems better than v6, I used it as is.
>>>>
>>>> Thanks for updating the patch!
>>>>
>>>> + if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
>>>> + pg_atomic_write_u64(&MyProc->waitStart,
>>>> + pg_atomic_read_u64((pg_atomic_uint64 *) &now));
>>>>
>>>> pg_atomic_read_u64() is really necessary? I think that
>>>> "pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.
>>>>
>>>> + deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
>>>> + pg_atomic_write_u64(&MyProc->waitStart,
>>>> + pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));
>>>>
>>>> Same as above.
>>>>
>>>> + /*
>>>> + * Record waitStart reusing the deadlock timeout timer.
>>>> + *
>>>> + * It would be ideal this can be synchronously done with updating
>>>> + * lock information. Howerver, since it gives performance impacts
>>>> + * to hold partitionLock longer time, we do it here asynchronously.
>>>> + */
>>>>
>>>> IMO it's better to comment why we reuse the deadlock timeout timer.
>>>>
>>>> proc->waitStatus = waitStatus;
>>>> + pg_atomic_init_u64(&MyProc->waitStart, 0);
>>>>
>>>> pg_atomic_write_u64() should be used instead? Because waitStart can be
>>>> accessed concurrently there.
>>>>
>>>> I updated the patch and addressed the above review comments. Patch attached.
>>>> Barring any objection, I will commit this version.
>>>
>>> Thanks for modifying the patch!
>>> I agree with your comments.
>>>
>>> BTW, I ran pgbench several times before and after applying
>>> this patch.
>>>
>>> The environment is virtual machine(CentOS 8), so this is
>>> just for reference, but there were no significant difference
>>> in latency or tps(both are below 1%).
>>
>> Thanks for the test! I pushed the patch.
>
> But I reverted the patch because buildfarm members rorqual and
> prion don't like the patch. I'm trying to investigate the cause
> of this failures.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10
- relation | locktype | mode
------------------+----------+---------------------
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR: invalid spinlock number: 0
"rorqual" reported that the above error happened in the server built with
--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that "waitStart"
atomic variable in the dummy proc created at the end of prepare transaction
was not initialized. I updated the patch so that pg_atomic_init_u64() is
called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.
BTW, while investigating this issue, I found that pg_stat_wal_receiver also
could cause this error even in the current master (without the patch).
I will report that in separate thread.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16
"prion" reported the following error. But I'm not sure how the changes of
pg_locks caused this error. I found that Heikki also reported at [1] that
"prion" failed with the same error but was not sure how it happened.
This makes me think for now that this issue is not directly related to
the pg_locks changes.
-------------------------------------
pg_dump: error: query failed: ERROR: missing chunk number 0 for toast value 14444 in pg_toast_2619
pg_dump: error: query was: SELECT
a.attnum,
a.attname,
a.atttypmod,
a.attstattarget,
a.attstorage,
t.typstorage,
a.attnotnull,
a.atthasdef,
a.attisdropped,
a.attlen,
a.attalign,
a.attislocal,
pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,
array_to_string(a.attoptions, ', ') AS attoptions,
CASE WHEN a.attcollation <> t.typcollation THEN a.attcollation ELSE 0 END AS attcollation,
pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(option_name) || ' ' || pg_catalog.quote_literal(option_value) FROM pg_catalog.pg_options_to_table(attfdwoptions) ORDER BY option_name), E',
') AS attfdwoptions,
a.attidentity,
CASE WHEN a.atthasmissing AND NOT a.attisdropped THEN a.attmissingval ELSE null END AS attmissingval,
a.attgenerated
FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_type t ON a.atttypid = t.oid
WHERE a.attrelid = '35987'::pg_catalog.oid AND a.attnum > 0::pg_catalog.int2
ORDER BY a.attnum
pg_dumpall: error: pg_dump failed on database "regression", exiting
waiting for server to shut down.... done
server stopped
pg_dumpall of post-upgrade database cluster failed
-------------------------------------
[1]
https://www.postgresql.org/message-id/f03ea04a-9b77-e371-9ab9-182cb35db1f9@iki.fi
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachment | Content-Type | Size |
---|---|---|
v9.patch | text/plain | 12.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2021-02-09 14:06:20 | Routine usage information schema tables |
Previous Message | John Naylor | 2021-02-09 13:40:02 | Re: Perform COPY FROM encoding conversions in larger chunks |