Re: Oracle Style packages on postgres

From: "Dave Held" <dave(dot)held(at)arraysg(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Oracle Style packages on postgres
Date: 2005-05-11 20:55:06
Message-ID: 49E94D0CFCD4DB43AFBA928DDD20C8F90261850A@asg002.asg.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Wednesday, May 11, 2005 2:22 PM
> To: Dave Held
> Cc: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Oracle Style packages on postgres
>
>
> "Dave Held" <dave(dot)held(at)arraysg(dot)com> writes:
> > /*
> > * We check the catalog name and then ignore it.
> > */
> > if (!isValidNamespace(name[0]))
> > {
> > if (strcmp(name[0],
> get_database_name(MyDatabaseId)) != 0)
> > ereport(ERROR,
>
> Which more or less proves my point: the syntax is fundamentally
> ambiguous.

Not at all. Ambiguity means that there are two equally valid
parses. Under the semantics I proposed, schema names take
precedence. That is, given:

db: foo
schema: bar
schema: foo.bar

The expression foo.bar.rel.col refers to schema foo.bar, and not
to db foo, schema bar. If by "fundamentally ambiguous", you mean
"there is no a priori reason to choose one set of semantics over
another", I would tend to disagree, but the syntax as I proposed
it is not ambiguous. We use precedence to eliminate otherwise
valid parses all the time.

> I suppose people would learn not to use schema names that
> match the database they are in, but that doesn't make it a
> good idea to have sensible behavior depend on non-overlap of
> those names.

There's nothing wrong with using a schema name that matches the
db. The only confusion comes when you put nested elements at
both the db level and schema level having the same names. Since
I presume most people don't specify db names in their queries,
having schemas take precedence makes the most sense to me.

> [ thinks for awhile ... ]
>
> OTOH, what if we pretended that two-level-nested schemas ARE
> catalogs in the sense that the SQL spec expects? Then we could
> get rid of the pro-forma special case here, which isn't ever
> likely to do anything more useful than throw an error anyway.
> Thus, we'd go back to the pre-7.3 notion that the current
> Postgres DB's name isn't part of the SQL naming scheme at all,
> and instead handle the spec's syntax requirements by setting up
> some conventions that make a schema act like what the spec says
> is a catalog.
> [...]

I think this would be worse than not having nested schemas at all.
It looks, feels, and smells like a hack. I think there should be
a reasonable depth to schema nesting, but I think it should be
much larger than 2. I think 8 is much more reasonable. One can
argue that nested schemas are nothing more than syntactic sugar,
and this is most definitely true. But as programming language
design teaches us, syntactic sugar is everything. The better our
tools can model our problem spaces, the better they can help us
solve our problems.

A way in which nested schemas are more than syntactic sugar is in
the fact that they can provide a convenient means of additinoal
security management. Rather than twiddling with the privileges on
groups of objects within a schema, objects that should have similar
privileges can be put in the same subschema.

However, returning to the original topic of the thread, nested
schemas are not nearly as interesting to me as the encapsulation
provided by a package-like feature. To be honest, though, what
tantalizes me is not the prospect of a package feature but an
expansion of the Type system.

As a reasonably popular production system, Postgres must necessarily
be conservative. But its roots lay in experimentation, and vestiges
of those roots can still be seen in its structure. Because of its
maturity, Postgres is well positioned to implement some rather
advanced concepts, but perhaps the most radical of them should be
implemented in a fork rather than the main system.

Traditionally, a database is seen as a warehouse of raw data.
ODBMSes position themselves as the next generation by viewing a
database as a collection of persistent, richly structured objects.
Both views have strengths and weaknesses. Postgres takes an
interesting middle ground position within the ORDBMS space. It
is heavily relational with strong support for standard SQL and
numerous query tuning options. But it also features an interesting
number of rather non-relational concepts, like custom operator
definitions, operator classes, user-defined conversions and types.
However, it seems to me that these features are probably very
underutilized.

This is probably due to two reasons: 1) most programmers aren't used
to being able to define custom operators in their favorite programming
language, so the concept isn't familiar enough to them to try it in
their DBMS. 2) The other features which support this aren't designed
or presented in a cohesive manner that impresses the programmer that
this is a compelling and superior way to go about things.

The fact is, operator overloading is a *very* powerful way to
program. In particular, it is one of the key factors in supporting
generic programming in a natural way. People who are unsure of this
claim should examine the Spirit parser generator or Blitz++, among
numerous other examples. It's something of a tragedy that such a
powerful feature is sitting languishing in a database system, when
most mainstream PLs have yet to catch up! But that is exactly the
problem. Operator overloading is a concept that is ahead of its
time, and was when it was invented way back in the early days of
Lisp.

Similarly, the user-defined conversions speak to a chimeric nature
of Postgres, which is its attitude towards typing. In some areas,
Postgres is strongly typed, and in others, it is weakly or dynamically
typed. Yet in others it is polymorphically typed. But Postgres isn't
alone in this conundrum. Languages that are very strongly typed tend
to require casting between types. Languages that are weakly typed
or dynamically typed are more prone to type errors (just ask Java
programmers about ClassCastException).

The trend in PL design is that strong typing is better. In fact, the
type system is the essence of a PL. The grammar determines what is a
legal program in a given language. The type system determines what is
*not* a legal program. By eliminating nonsensical programs, a type
system enforces good programming behavior. So the power of a type
system is not in what it enables, but rather what it *disables* or
disallows. That is why encapsulation is good. It does not add a
powerful new form of computation to a programming language. Rather,
it *restricts* what can be done so that dangerous operations can be
eliminated or at least localized to an easily-inspected region of
the code.

Postgres has an opportunity to revolutionize database design by
introducing a strongly typed DBMS that offers some of the features
of a pure ODBMS while providing the traditional RDBMS capabilities.
I don't know exactly what such a beast would look like, but I do know
that I see a lot of the pieces already in Postgres, waiting to be
chipped and molded and fit together into a more elegant whole. Let
me give an illustration of how this might possibly come about.

The canonical example is the Employee table. In a flat file world,
this table would be a simple set of records, each containing possibly
redundant information about a set of employees. Names, ssn, phones,
position, salary, etc. To access the employees, we fetch the records
of interest from the DB and manipulate them mostly in our client or
middleware code. The DB does nothing more than load and save our
data, and assist with searches.

In the relational world, we factor out the common data and put them
into different relations, linking to them with foreign keys. We
basically break the record up into pieces, and reassemble the pieces
as needed. For compiling reports that look at the data as an aggregate
in different ways, this is a very powerful and efficient way to go
about things. For dealing with individual employees as singular
entities, this is not such an efficient way to do things. It is very
space efficient, but it pays for that space efficiency with time.
When we want our time back, we flatten the relations and materialize
our views and basically tread back towards the the flat file world.
When a client wants a tuple, the DBMS has to ressurect the tuple from
its component relations, as if it were breathing life into it from the
clay of the earth. However, the DBMS can be clever and offer some
server-side triggers or procedures to assist the client in manipulating
the object.

In the ODBMS world, we return partway to the flat file world by
serializing objects in a more or less flat way. However, the objects
can refer to each other in a relational way, and those relations are
captured as well. But the normalization typical in the relational
world is by no means typical of the OOP world, so the data factoring
is much less aggressive, which has both drawbacks and benefits,
depending on the application. Once again, the DBMS is just a data
server, serializing and unserializing objects to disk. The process is
very fast and clean for object-oriented data, but not so fast for
flat data.

The problem in the database world is the same as the problem in the
PL world: people are conflating types and representations. A
representation is a raw data format, like a 16-bit 2's complement
signed integer, or an 8-bit ASCII character. A type is a set of
values, an instance of which is stored in some representation. An
Integer may be stored in the 16-bit signed int, or it may be
represented by some 128-bit memory location. But really, most
real-world problems do not have Integers as a natural type in the
model. Going back to our Employee, the natural types will be more
like SSN: the values are always 9 digits long and there's rules
about allowed prefixes, etc. The set of values in the SSN type is
smaller than the set of values in the 9-digit-Integers type. So
9-digit-Integer can be a representation of SSN, but an SSN *type*
implies more. It doesn't make sense to do arithmetic on an SSN,
so arithmetical operations should neither be defined nor allowed
on SSNs. To do arithmetic on one, you should be forced to cast to
another type, and the cast should check whether this is a sensible
or valid operation in the current context.

Now the thing about the relational model is that it is essentially
about *representations*. So the things that we call "attribute
types" are really "attribute representations", which is why there
are so few of them by default. In reality, every problem domain
brings a large host of richly defined types that map onto those
representations to a greater or lesser degree.

The reason people want packages is because functions in Postgres,
and indeed, most RDBMSes, operate on representations, which allow
you to do something like take the square root of an SSN, which
makes perfect sense if the SSN is a numeric(9, 0). Programmers
who want greater correctness see that type safety is the way to
get there. And type safety can only be enforced in a strongly
typed environment. The interesting thing is that domains are
essentially the formal notion of a type in the type vs.
representation distinction. In a strongly typed ORDBMS, tuples
would be defined as sets of domains, rather than sets of "types"
(representations). And operations on tuples would be strictly
type checked.

Note that the other features of OOP, inheritance and polymorphism,
are really just ways to define exceptions or loopholes in the
type system. Inheritance allows you say that one type is
substitutable for another, and polymorphism says that an operation
knows how to act on more than one type, even when the types aren't
related.

So back to the future...in an ORDBMS world you should not ask for
a collection of representations. You should ask for a collection
of objects. You should not ask for transformation of tuples, but
rather transformation of objects and sets of objects. And the
user should be able to define a sufficiently rich interface to
those objects that the ORDBMS can easily fulfill them.

Say you want to increase the salary of every employee at pay grade
3. You should not be able to directly manipulate the salary field
of the Employee relation. Rather, you should define a method
like RaiseSalary() that operates on Employees (or better yet, is
a member of the Employee type), and then define the set of
Employees over which that method should be applied. To use a
somewhat mathematical notation:

RaiseSalary({Employee e | e.payGrade() == 3});

That's not say that SQL should go away. But it should be a language
of the representations, an implementation detail that is hidden by
a more powerful type-safe facade that helps ensure program
correctness.

Of course, I'm not making any specific proposal to change Postgres
at this time. I'm just giving the community some food for thought
to chew on when considering future directions of DBMSes in general
and Postgres in particular. I could go on about generic programming
and its relation to the ideas above and the current features in
Postgres, but I will save that discussion for a rainy day.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2005-05-11 20:55:35 Re: [HACKERS] plperl and pltcl installcheck targets
Previous Message Jim C. Nasby 2005-05-11 20:51:40 Re: Server instrumentation for 8.1