Re: SQL:2011 application time

From: Paul Jungwirth <pj(at)illuminatedcomputing(dot)com>
To: Isaac Morland <isaac(dot)morland(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, jian he <jian(dot)universality(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: SQL:2011 application time
Date: 2024-06-05 20:56:15
Message-ID: f061295a-d972-42b9-bf1b-90a833e3f686@illuminatedcomputing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/21/24 11:27, Isaac Morland wrote:
> On Tue, 21 May 2024 at 13:57, Robert Haas <robertmhaas(at)gmail(dot)com <mailto:robertmhaas(at)gmail(dot)com>> wrote:
>
> What I think is less clear is what that means for temporal primary
> keys. As Paul pointed out upthread, in every other case, a temporal
> primary key is at least as unique as a regular primary key, but in
> this case, it isn't. And someone might reasonably think that a
> temporal primary key should exclude empty ranges just as all primary
> keys exclude nulls. Or they might think the opposite.
>
>
> Fascinating. I think you're absolutely right that it's clear that two empty intervals don't
> conflict. If somebody wants to claim two intervals conflict, they need to point to at least one
> instant in time that is common between them.
>
> But a major point of a primary key, it seems to me, is that it uniquely identifies a row. If items
> are identified by a time range, non-overlapping or not, then the empty range can only identify one
> item (per value of whatever other columns are in the primary key). I think for a unique key the
> non-overlapping restriction has to be considered an additional restriction on top of the usual
> uniqueness restriction.
>
> I suspect in many applications there will be a non-empty constraint; for example, it seems quite
> reasonable to me for a meeting booking system to forbid empty meetings. But when they are allowed
> they should behave in the mathematically appropriate way.

Finding a way forward for temporal PKs got a lot of discussion at pgconf.dev (thanks especially to
Peter Eisentraut and Jeff Davis!), so I wanted to summarize some options and describe what I think
is the best approach.

First the problem: empty ranges! A temporal PK/UNIQUE constraint is basically an exclusion
constraint that is `(id WITH =, valid_at WITH &&)`. But the special 'empty' value never overlaps
anything, *including itself*. (Note it has no "position": [3,3) is the same as [4,4).) Since the
exclusion constraint forbids overlapping ranges, and empties never overlap, your table can have
duplicates. (I'm talking about "literal uniqueness" as discussed in [1].) For instance:

CREATE EXTENSION btree_gist;
CREATE TABLE t (id int, valid_at daterange, name text);
ALTER TABLE t ADD CONSTRAINT tpk PRIMARY KEY (id, valid_at WITHOUT OVERLAPS);
INSERT INTO t VALUES (1, 'empty', 'foo');
INSERT INTO t VALUES (1, 'empty', 'bar');

Multiranges have the same problem. So what do we do about that?

**Option 0**: Allow it but document it. It shouldn't happen in practice: there is no reason for an
empty range to get into a temporal table, and it arguably doesn't mean anything. The record is true
at no time? But of course it will happen anyway. It's a footgun and will break expectations for at
least some.

It causes problems for us too. If you say `SELECT name FROM t GROUP BY id, valid_at`, we recognize
that `name` is a functional dependency on the PK, so we allow it and give you the first row matching
each key. You might get "foo" or you might get "bar". Also the planner uses not-nullable uniqueness
to take many shortcuts. I couldn't create any concrete breakage there, but I bet someone else could.
PKs that are not literally unique seems like something that would cause headaches for years.

**Option 1**: Temporal PKs should automatically create a CHECK constraint that forbids empty ranges.
Should UNIQUE constraints too? I'm tempted to say no, since sometimes users surprise us by coming up
with new ways to use things. For instance one way to use empty ranges is to reference a temporal
table from a non-temporal table, since `'empty' <@ anything` is always true (though this has
questionable meaning or practical use). But probably we should forbid empties for UNIQUE constraints
too. Forbidding them is more aligned with the SQL standard, which says that when you have a PERIOD,
startcol < endcol (not <=). And it feels more consistent to treat both constraints the same way.
Finally, if UNIQUEs do allow empties, we still risk confusing our planner.

My last patch created these CHECK constraints for PKs (but not UNIQUEs) as INTERNAL dependencies.
It's pretty clunky. There are lots of cases to handle, e.g. `ALTER COLUMN c TYPE` may reuse the PK
index or may generate a new one. And what if the user already created the same constraint? Seeing
all the trouble giving PKs automatic (cataloged) NOT NULL constraints makes me wary about this
approach. It's not as bad, since there is no legacy, but it's still more annoying than I expected.

Finally, hanging the CHECK constraint off the PK sets us up for problems when we add true PERIODs.
Under 11.27 of SQL/Foundation, General Rules 2b says that defining a PERIOD should automatically add
a CHECK constraint that startcol < endcol. That is already part of my last patch in this series. But
that would be redundant with the constraint from the PK. And attaching the constraint to the PERIOD
is a lot simpler than attaching it to the PK.

**Option 2**: Add a new operator, called &&&, that works like && except an empty range *does*
overlap another empty range. Empty ranges should still not overlap anything else. This would fix the
exclusion constraint. You could add `(5, 'empty')` once but not twice. This would allow empties to
people who want to use them. (We would still forbid them if you define a PERIOD, because those come
with the CHECK constraint mentioned above.)
And there is almost nothing to code. But it is mathematically suspect to say an empty range overlaps
something small (something with zero width) but not something big. Surely if a && b and b <@ c, then
a && c? So this feels like the kind of elegant hack that you eventually regret.

**Option 3**: Forbid empties, not as a reified CHECK constraint, but just with some code in the
executor. Again we could do just PKs or PKs and UNIQUEs. Let's do both, for all the reasons above.
Not creating a CHECK constraint is much less clunky. There is no catalog entry to create/drop. Users
don't wonder where it came from when they say `\d t`. It can't conflict with constraints of their
own. We would enforce this in ExecConstraints, where we enforce NOT NULL and CHECK constraints, for
any table with constraints where conperiod is true. We'd also need to do this check on existing rows
when you create a temporal PK/UQ. This option also requires a new field in pg_class: just as we have
relchecks, relhasrules, relhastriggers, etc. to let us skip work in the relcache, I assume we'd want
relperiods.

**Option 4**: Teach GiST indexes to enforce uniqueness. We didn't discuss this at pgconf, at least
not in reference to the empties problem. But I was thinking about this request from Matthias for
temporal PKs & UQs to support `USING INDEX idx`.[2] It is confusing that a temporal index has
indisunique, but if you try to create a unique GiST index directly we say they don't support UNIQUE
indexes! Similarly `pg_indexam_has_property(783, 'can_unique')` returns false. There is something
muddled about all that. So how about we give the GiST AM handler amcanunique?

As I understand it, GiST indexes are capable of uniqueness,[3] and indeed today you can create an
exclusion constraint with the same effect, but in the past the core had no way of asking an opclass
which operator gave equality. With the stratnum support proc from 6db4598fcb (part of this patch
series, but reverted from v17), we could get a known operator for "equals". If the index's opclasses
had that sproc and it gave non-zero for RTEqualStrategyNumber, then CREATE UNIQUE INDEX would
succeed. We would just ("just") need to make GiST raise an error if it found a duplicate. And if
*that* was happening, the empty ranges wouldn't cause a problem.

I think Option 3 is good, but I like Option 4 a lot because (1) it doesn't assume ranges &
multiranges (2) it allows empties if users have some reason for them (3) since the real problem is
duplicates, forbidding them is a more precise solution, (4) it clears up the confusing situation of
GiST not being canunique, even though you can create an index with indisunique.

OTOH it is probably more work, and it is slower than just forbidding duplicates. (The unique check
requires a separate index search, according to [3], as an exclusion constraint would do.) Also if we
do it to make GiST be canunique, that can happen separately from the temporal work.

So I'm proceeding with Option 3, which at worst can eventually become an optimization for Option 4.
I don't think forbidding empty ranges is a great loss to be honest. But if anyone has any feedback,
please share: ojections, alternatives, advice---all is welcome.

[1]
https://www.postgresql.org/message-id/47550967-260b-4180-9791-b224859fe63e%40illuminatedcomputing.com
[2]
https://www.postgresql.org/message-id/CAEze2Wh21V66udM8cbvBBsAgyQ_5x9nfR0d3sWzbmZk%2B%2Bey7xw%40mail.gmail.com
[3] https://dsf.berkeley.edu/papers/sigmod97-gist.pdf

Yours,

--
Paul ~{:-)
pj(at)illuminatedcomputing(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Jungwirth 2024-06-05 20:57:40 Re: SQL:2011 application time
Previous Message Robert Haas 2024-06-05 20:55:58 Re: [multithreading] extension compatibility