From: | Paul Jungwirth <pj(at)illuminatedcomputing(dot)com> |
---|---|
To: | Peter Eisentraut <peter(at)eisentraut(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, jian he <jian(dot)universality(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com> |
Subject: | Re: SQL:2011 application time |
Date: | 2025-01-29 06:34:33 |
Message-ID: | e9eca7ab-3624-4054-9c93-bd115049aab8@illuminatedcomputing.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 1/27/25 07:47, Peter Eisentraut wrote:
> On 24.01.25 03:55, Tom Lane wrote:
>> I've now run an exhaustive search through the last three months of
>> buildfarm runs, and found just one additional instance of the same
>> failure. The three matches are
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2025-01-22%2005%3A49%3A08
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-01-22%2001%3A29%3A35
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-01-22%2001%3A17%3A14
>>
>> Since those are all post-1772d554b0, it's difficult to avoid the
>> conclusion that that either introduced the error or allowed a
>> pre-existing problem to become visible.
>
> I found a few more in the bowels of various Cirrus CI jobs:
>
> https://cirrus-ci.com/task/5125479033733120
> -> https://api.cirrus-ci.com/v1/artifact/task/5125479033733120/testrun/build/testrun/regress/
> regress/regression.diffs
>
> https://cirrus-ci.com/task/4562529080311808
> -> https://api.cirrus-ci.com/v1/artifact/task/4562529080311808/log/src/test/recovery/tmp_check/
> regression.diffs
>
> https://cirrus-ci.com/task/5985049025183744
> -> https://api.cirrus-ci.com/v1/artifact/task/5985049025183744/log/src/bin/pg_upgrade/tmp_check/
> regression.diffs
>
> https://cirrus-ci.com/task/4702566639992832
> -> https://api.cirrus-ci.com/v1/artifact/task/4702566639992832/log/src/test/recovery/tmp_check/
> regression.diffs
Thanks to you both for finding some more examples! This answers one question I've been wondering
about: Why do we get this failure for range types . . .:
```
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out
/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/without_overlaps.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out 2025-01-25
03:14:11.906404866 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/without_overlaps.out 2025-01-25
03:21:08.218167913 +0000
@@ -1792,8 +1792,6 @@
SET valid_at = CASE WHEN lower(valid_at) = '2018-01-01' THEN daterange('2018-01-01', '2018-01-05')
WHEN lower(valid_at) = '2018-02-01' THEN daterange('2018-01-05',
'2018-03-01') END
WHERE id = '[6,7)';
-ERROR: update or delete on table "temporal_rng" violates RESTRICT setting of foreign key
constraint "temporal_fk_rng2rng_fk" on table "temporal_fk_rng2rng"
-DETAIL: Key (id, valid_at)=([6,7), [2018-01-01,2018-02-01)) is referenced from table
"temporal_fk_rng2rng".
-- a PK update that fails because both are referenced (even before commit):
BEGIN;
ALTER TABLE temporal_fk_rng2rng
```
. . . but never a failure later in the file for the same scenario with multiranges? But Peter's
links [1] now include an example of that too:
```
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out
/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/results/without_overlaps.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out 2025-01-25
04:31:18.353128126 +0000
+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/results/without_overlaps.out 2025-01-25
04:35:22.020363327 +0000
@@ -2311,8 +2311,6 @@
SET valid_at = CASE WHEN lower(valid_at) = '2018-01-01' THEN
datemultirange(daterange('2018-01-01', '2018-01-05'))
WHEN lower(valid_at) = '2018-02-01' THEN
datemultirange(daterange('2018-01-05', '2018-03-01')) END
WHERE id = '[6,7)';
-ERROR: update or delete on table "temporal_mltrng" violates RESTRICT setting of foreign key
constraint "temporal_fk_mltrng2mltrng_fk" on table "temporal_fk_mltrng2mltrng"
-DETAIL: Key (id, valid_at)=([6,7), {[2018-01-01,2018-02-01)}) is referenced from table
"temporal_fk_mltrng2mltrng".
-- a PK update that fails because both are referenced (even before commit):
BEGIN;
ALTER TABLE temporal_fk_mltrng2mltrng
```
So that is relieving. Still it's interesting that it's a 6:1 ratio.
I've attached a patch that causes both failures to appear every time (v48.0). It shows that if the
RESTRICT constraint accidentally loaded the cached query plan from the most recently cached NO
ACTION constraint (which we test just before testing RESTRICT), it would create matching failures.
So some kind of oid conflict could cause that.
On 1/27/25 07:56, Peter Eisentraut wrote:
> I think this call in ri_restrict()
>
> ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
>
> needs to use a different third argument for NO ACTION vs. RESTRICT, since we are now sometimes using
> different queries for them.
>
> However, the RI_QueryKey also uses the constraint OID as part of the hash key, so even this mistake
> would not trigger any bad effect unless we also have OID collisions?
That is my take too. I haven't worked out how an OID collision could happen though. Since we are
running the tests in parallel could the other tests generate enough oids to roll the counter around?
Surely not. And I confirmed the dynahash does a memcmp on the whole 64 bits of key->constr_id +
key->constr_queryno. Landing in the same hash bucket shouldn't be a problem (though I haven't tested
that, e.g. by using a debugger to manipulate the hash result). So I don't have anything plausible here.
I thought about introducing a new RI_PLAN_NO_ACTION constant back when I wrote the patch. It
shouldn't be needed, but it would be reassuring to include it, especially since the generated SQL
changes on NO ACTION vs RESTRICT. (On the other hand the generated SQL also depends on the PK/FK
attributes and their comparison operators, and we don't include *those* in the cache key.)
Is it possible to commit an RI_PLAN_NO_ACTION addition and see if that makes the buildfarm failures
go away? Here is a proposed patch for that (v48.1). I would understand if this is too questionable a
practice---but it would be nice to get sufficient test exposure to see if it makes a difference.
Since I still haven't reproduced this locally (despite running continuously for almost a week), it's
not an experiment I can do myself. If it *does* make the failures go away, then it suggests there is
still some latent problem somewhere.
I took a look through the dynahash code as well as GetNewOidWithIndex/GetNewObjectId. Neither of
these really seem like likely places to find a bug to me, considering how mature and heavily-used
they are. The hash table isn't even shared between backends. The way we keep a nextOid in shared
memory and keep incrementing it until we find a gap is maybe interesting. Since we drop & create a
constraint right before the failing test, I guess it's possible to cycle around and get the same oid
as the dropped constraint. I don't really buy it though. We would have to give the NO ACTION
constraint a very low oid (so there aren't lower-numbered gaps produced by the same test file), drop
it, somehow consume 2^32 oids (in the other tests from the parallel group), and then land back on
the low-numbered open oid. Also from the failure I checked I don't see any log messages about "new
OID has been assigned in relation ... after ... retries". I guess you could hit the right oid
without that, but it seems hard.
That's it so far. Adding v48.1 would at least give us some more evidence about where to look for
problems. In the meantime I'll keep searching for a way to reproduce it!
[1] https://cirrus-ci.com/task/5985049025183744 ->
https://api.cirrus-ci.com/v1/artifact/task/5985049025183744/log/src/bin/pg_upgrade/tmp_check/regression.diffs
Yours,
--
Paul ~{:-)
pj(at)illuminatedcomputing(dot)com
Attachment | Content-Type | Size |
---|---|---|
v48.0-0001-Trigger-the-RESTRICT-build-farm-failures-consis.patch.nocfbot | text/plain | 1.3 KB |
v48.1-0001-Cache-NO-ACTION-foreign-keys-separately-from-R.patch | text/x-patch | 2.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2025-01-29 06:46:11 | Re: Proposal for Updating CRC32C with AVX-512 Algorithm. |
Previous Message | Sergey Tatarintsev | 2025-01-29 06:26:08 | Re: create subscription with (origin = none, copy_data = on) |