From: | Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com> |
---|---|
To: | Corey Huinker <corey(dot)huinker(at)gmail(dot)com> |
Cc: | Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: SQL:2011 application time |
Date: | 2021-09-14 03:56:47 |
Message-ID: | CA+renyXTJ_mACO=_pZ7q06qaGy3KsG60Q66HXypCGzSxM_qRkA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Corey,
Thanks for all the good questions!
> 1. This patch creates a pg_period catalog table, whereas the system versioning relies on additions to pg_attribute to identify the start/end columns. Initially I thought this was because it was somehow possible to have multiple application periods defined on a table, but in reading [1] I see that there are some design suppositions that would make a second application period impossible[2]. I can also see where having this table would facilitate the easy creation of INFORMATION_SCHEMA.PERIODS. I was previously unaware that this info schema table was a thing, but I have found references to it, though I'm unclear as to whether it's supposed to have information about system versioned tables in it as well.
Yes, information_schema.periods is given by the standard. Having
pg_period seems like a natural place to store periods, since they are
separate entities. I think that is a better design than just storing
them as extra fields in pg_attribute. It follows normal normalization
rules.
The standard forbids multiple application-time periods per table. From
SQL:2011 in the SQL/Foundation section
(7IWD2-02-Foundation-2011-12.pdf available from
http://www.wiscorp.com/sql20nn.zip) under 11.27 <add table period
definition>:
> 5) If <table period definition> contains <application time period specification> ATPS, then:
> b) The table descriptor of T shall not include a period descriptor other than a system-time period descriptor.
In other words you can add both a SYSTEM TIME period and one other
application-time period (whose name is your choice), but if you
already have an application-time period, you can't add another one.
I also checked other RDBMSes and none of them allow it either:
In Mariadb 10.6.4 (the latest) I get "ERROR 4154 (HY000); Cannot
specify more than one application-time period".
Oracle disallows it with a vague error:
SQL> create table t2 (id int, valid_from date, valid_til date,
period for valid_at (valid_from, valid_til), period for valid_at2
valid_from, valid_til));
create table t2 (id int, valid_from date, valid_til date, period
for valid_at (valid_from, valid_til), period for valid_at2
(valid_from, valid_til))
*
ERROR at line 1:
ORA-55603: invalid flashback archive or valid time period command
(Using different start/end columns for each period doesn't change the result.)
In IBM DB2 you can only have one because application-time periods must
be named "business_time" (not joking).
Mssql (2019) doesn't support application periods.
Personally I feel like it's a weird limitation and I wouldn't mind
supporting more, but my current implementation only allows for one,
and I'd have to rethink some things to do it differently.
Also: I think information_schema.periods *should* include SYSTEM_TIME
periods. The spec says (in SQL/Schemata, file
7IWD2-11-Schemata-2011-12.pdf at the link above), "The PERIODS base
table has one row for each period defined for a table. It effectively
contains a representation of the period descriptors." It doesn't say
anything about excluding system-time periods.
I checked mariadb, mssql, oracle, and db2, and I could only find this
table in db2, as syscat.periods. It includes both application-time and
system-time periods.
The spec calls for the columns table_catalog, table_schema,
table_name, period_name, start_column_name, and end_column_name. There
isn't a column to distinguish the period type, but since a period is a
system-time period iff its name is "SYSTEM_TIME", technically such a
column isn't needed.
The db2 columns are periodname, tabschema, tabname, begincolname,
endcolname, periodtype, historytabschema, and historytabname. The
periodtype column is either A or S (for application-time or
system-time).
> Q 1.1. Would a bitemporal table have two entries in that view?
Yes.
> Q 1.2. Could you see being able to implement this without pg_period, using only additions to pg_attribute (start/end for system temporal, start/end for application, plus an addition for period name)?
Not just period name, but also the range type associated with the
period (which should be determined at period creation, so that you can
pass an option to disambiguate if there are two ranges defined for the
same base type), the constraint oid (to prevent end <= start), and
some more data for inherited tables (not really used yet). It seems
ugly to hang all these extra values on a pg_attribute record.
> Q 1.3. Can you see a way to represent the system versioning in pg_period such that bitemporal tables were possible?
Yes. Even though the name "SYSTEM_TIME" is technically enough, I'd
still include a pertype column to make distinguishing system vs
application periods easier and more obvious.
> 2. The system versioning effort has chosen 'infinity' as their end-time value, whereas you have chosen NULL as that makes sense for an unbounded range. Other databases seem to leverage '9999-12-31 23:59:59' (SQLServer, IIRC) whereas some others seem to used '2999-12-31 23:59:59' but those might have been home-rolled temporal implementations. To further add to the confusion, the syntax seems to specify the keyword of MAXVALUE, which further muddies things. The system versioning people went with 'infinity' seemingly because it prescribe and end to the world like SQLServer did, but also because it allowed for a primary key based on (id, endtime) and that's just not possible with NULL endtime values.
I think it's a little weird that our system-time patch mutates your
primary key. None of the other RDMBSes do that. I don't think it's
incompatible (as long as the system time patch knows how to preserve
the extra period/range data in an application-time temporal key), but
it feels messy to me.
I would prefer if system-time and application-time used the same value
to mean "unbounded". Using null means we can support any type (not
just types with +-Infinity). And it pairs nicely with range types. If
the only reason for system-time to use Infinity is the primary key, I
think it would be better not to mutate the primary key (and store the
historical records in a separate table as other RDMSes do).
Btw Oracle also uses NULL to mean "unbounded".
We presently forbid PKs from including expressions, but my patch lifts
that exception so it can index a rangetype expression built from the
period start & end columns. So even if we must include the system-time
end column in a PK, perhaps it can use a COALESCE expression to store
Infinity even while using NULL to signify "currently true" from a user
perspective.
> 3. I noticed some inconsistency in the results from various "SELECT * FROM portion_of_test" examples. In some, the "valid_at" range is shown but not columns that make it up, and in some others, the "valid_from" and "valid_to" columns are shown, with no mention of the period. From what I've seen, the period column should be invisible unless invoked, like ctid or xmin.
In most cases the tests test the same functionality with both PERIODs
and rangetype columns. For FKs they test all four combinations of
PERIOD-referencing-PERIOD, PERIOD-referencing-range,
range-referencing-PERIOD, and range-referencing-range. If valid_at is
a genuine column, it is included in SELECT *, but not if it is a
PERIOD.
> 4. The syntax '2018-03-04' AT TIME ZONE INTERVAL '2' HOUR TO MINUTE simply confounded me.
Me too! I have no idea what that is supposed to mean. But that
behavior predates my patch. I only had to deal with it because it
creates a shift-reduce conflict with `FOR PORTION OF valid_at FROM x
TO y`, where x & y are expressions. I asked about this syntax at my
PgCon 2020 talk, but I haven't ever received an answer. Perhaps
someone else knows what this kind of INTERVAL means (as a modifier of
a time value).
> 5. I haven't seen any actual syntax conflicts between this patch and the system versioning patch. Both teams added basically the same keywords, though I haven't dove more deeply into any bison incompatibilities. Still, it's a great start.
I think that's right. Early on the other patch used `FOR PERIOD SYSTEM
TIME (x, y)` instead of the standard `FOR PERIOD SYSTEM_TIME (x, y)`
but I believe that was fixed, so that the period name is an identifier
and not two keywords.
> 6. Overall, I'm really excited about what this will mean for data governance in postgres.
Me too, and thank you for the detailed review!
Yours,
Paul
From | Date | Subject | |
---|---|---|---|
Next Message | Sehrope Sarkuni | 2021-09-14 03:56:52 | Re: Add jsonlog log_destination for JSON server logs |
Previous Message | Tom Lane | 2021-09-14 03:53:24 | Re: postgres.h included from relcache.h - but removing it breaks pg_upgrade |