From: | Darren Duncan <darren(at)darrenduncan(dot)net> |
---|---|
To: | Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Fetter <david(at)fetter(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Christopher Manning <c(at)christophermanning(dot)org>, fabriziomello(at)gmail(dot)com |
Subject: | Re: Proposal to add --single-row to psql |
Date: | 2013-05-21 02:18:16 |
Message-ID: | 519AD968.3030209@darrenduncan.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I have actually been working on the task discussed in this thread, most relevant
parts quoted below, for awhile now, and hope to have something concrete that you
can use by the end of this summer.
My in-development Muldis D language is homoiconic as a core feature, its source
code is data to it, and the native syntax of the language resembles an abstract
syntax tree. This tree of nodes primarily defines behavior of the code, but it
also supports arbitrary metadata per node, which for example can be used to
preserve concrete syntax for any programming language that can be parsed into
Muldis D nodes or conversely generated from said.
For example, you can have one type of node defining a function, and its details
are defined by its attributes or child nodes, such as its result type, its
parameters, whether it is declared associative/commutative/etc or not, and the
expression(s) defining its body. Another type of node defines a call to a
function, another type defines a text literal, another a relation literal, and
so on. Conversely, a node can define a schema or package etc. Example
metadata, which is also structured, could include line numbers or code comments
or whitespace or exact numeric literal formats or quoting formats or exact
keyword choices, for example.
Using these node data types, we have a fairly normalized representation of any
source code (or data) that is easy to introspect or transform with code. A key
design of Muldis D is the clear separation of syntax and behavior.
A parser is just a function that takes (typically) a character string as input
and produces a node (tree) as output. A compiler is just a function or
operation that takes a node (tree) as input and produces machine code. An
optimizer can be a function that derives one node tree from another, either as
its own operation or typically as part of the compiler stage.
A compiler or optimizer generally can trivially ignore the node metadata, but a
code generator or debugger can use it; metadata can be stripped without
affecting behavior. The canonical tree form can also easily be mapped
losslessly with relation forms, such as typical information schemas have.
Practically all behavior is defined in terms of generic type and routine
definitions. Practically all system-defined types and routines are defined in
one or more libraries/modules/packages that have the same format as those users
would write like extensions. So, all the relational operators have the same
syntax as say string or numeric or array operators.
I envision that the most effective way to use Muldis D to handle an arbitrary
programming language, including the native SQL syntax+behavior of each SQL DBMS,
is to have a Muldis D library+parser pair for it.
For example, to represent PostgreSQL 9.2 most directly, we have a library with
an explicitly defined type and routine for every built-in that Pg 9.2 has, and
we also have a parser function that takes the SQL syntax that Pg 9.2 understands
and produces a Muldis D node tree consisting of calls to the routines of that
library or value selectors of types in that library (things like SELECT and
INSERT etc are each mapped to a routine too). That way, even with a standard
node format that isn't specific to a typical language or version, the code for
parsing Pg 9.2 SQL has the minimal amount of indirection that it has to deal
with, as each syntax element would have a direct library call counterpart.
Similarly, the existing Pg 9.2 SQL compiler would have the least indirection to
take said nodes and execute them. (The library would be named eg "Postgres_9_2"
for example, which makes it easier say to also have one side-by-side for other
versions, shims, legacy code you want to more easily support compatibility with.)
Where one decides to do cross-compilation, say make Oracle SQL run on Pg, that
could be done as simply as defining a library for Oracle SQL with the
routines+types that has, and then mapping it to a Pg one just in terms of shim
calls (which the compiler can optimize away as applicable), and so parsers or
compilers never necessarily have to deal with behavior compatibility issues.
I am presently working out the exact set of such language nodes, and making a
reference implementation which is mostly self-defined and would compile to Perl
code, and hope to have the first Muldis D executable by the end of this summer.
I am confident that an adaption of this design into C or whatever would serve
Postgres greatly in letting it effectively parse multiple languages, anywhere
from application programming languages to multiple SQL dialects or versions.
Even if you don't decide to use my design specifically (which is open source and
you can influence it), I think you should find some of the general design
principles I stated above to be useful. Representing behavior as libraries the
AST being flexible enough for any concrete language without being too specific
to details of one. And of course, cross-invocation of code written in multiple
languages is made much easier.
Note, just to be clear, my proposal does not necessitate that all of a node tree
has to be kept in memory at once. This design should be adaptable to a
streaming approach, especially as it is expected to be able to handle database
dumps or transfers of arbitrary size, same as SQL engines can today. That is in
contrast to probably what most application languages would assume, where
everything would fit in memory at once. But the ability to stream or not would
largely be an implementation detail. Realistically, all code should fit in
memory at once, and anything that would have to be buffered out of memory would
be say embedded data literals (whether large strings or simply large relations).
If you're not sure how my proposal would address any of the needs or wants
raised in the thread, go ahead and ask, and I will try and answer as time permits.
-- Darren Duncan
On 2013.05.11 9:27 AM, Tom Lane wrote:
> David Fetter <david(at)fetter(dot)org> writes:
>> On Sat, May 11, 2013 at 11:17:03AM -0400, Robert Haas wrote:
>>> Some kind of extendable parser would be awesome. It would need to tie
>>> into the rewriter also.
>>>
>>> No, I don't have a clue what the design looks like.
>
>> That's a direction several of the proprietary RDBMS vendors have
>> proposed. I think it'd be great :)
>
>> Pre-coffee (yeah, I know. Bad idea.) sketch of an idea: create an API
>> to pass expression trees in and out. This could have other benefits
>> as to clustering space, shortening the planning cycle, etc., but let's
>> not go there for now. My knowledge is very, very sketchy, but when I
>> squint, the expression trees we use look a lot like JSON. Are they
>> isomorphic?
>
> By the time you've got an expression tree, the problem is mostly solved,
> at least so far as parser extension is concerned.
>
> More years ago than I care to admit, I worked on systems that had
> run-time-extensible parsers at Hewlett-Packard, so technology for this
> does exist. But my (vague) memory of those systems is that the parser's
> language capabilities were more limited than bison's, perhaps only
> LL(1). Parsing spec-compatible SQL that way might be a challenge.
>
> A larger issue is that if you don't have the whole grammar available
> to check, it's difficult to be sure there are no parsing conflicts.
> I seem to remember that we hit some conflicts between different
> extension ROMs back at HP :-(
>
> Another point is that extensions that are actually interesting require
> a lot more than new syntax. Robert mentioned the rewriter, but you'd
> typically need planner and executor additions as well. It's possible to
> see how whole new plan node types might be added by a plugin so far as
> the executor is concerned, but I haven't a clue how we'd get the planner
> to emit them ...
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2013-05-21 02:38:50 | Re: [PATCH] Correct release notes about DROP TABLE IF EXISTS and add, link. |
Previous Message | Michael Paquier | 2013-05-20 23:50:25 | Removal of pageinspect--1.0.sql |