providing tokenized version of parsed SQL script (was: nodeToString format and exporting the SQL parser)

From: "Jehan-Guillaume (ioguix) de Rorthais" <ioguix(at)free(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, David Fetter <david(at)fetter(dot)org>, Markus Wanner <markus(at)bluegap(dot)ch>, Michael Tharp <gxti(at)partiallystapled(dot)com>
Subject: providing tokenized version of parsed SQL script (was: nodeToString format and exporting the SQL parser)
Date: 2010-04-27 23:45:34
Message-ID: 4BD7771E.8090805@free.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 25/04/2010 03:02, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sat, Apr 24, 2010 at 8:07 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> Sounds useful to me, though as a function like suggested in a later
>>> email.
>
>> If tool-builders think this is useful, I have no problem with making
>> it available. It should be suitably disclaimed: "We reserve the right
>> to rip out the entire flex/yacc-based lexer and parser at any time and
>> replace them with a hand-coded system written in Prolog that emits
>> tokenization information only in ASN.1-encoded pig latin. If massive
>> changes in the way this function works - or its complete disappearance
>> - are going to make you grumpy, don't call it."
>
> I'm a bit concerned with the vagueness of the goals here. We started
> with a request to dump out node trees, ie, post-parsing representation;
> but the example use case of syntax highlighting would find that
> representation quite useless. (Example: foo::bar and CAST(foo AS bar)
> yield the same parse tree.)

Well, the tokenizer stuff was actually my understanding of the following
quote from Michael Tharp :
« ... making the internal SQL parser available to clients via a
C-language SQL function. ».

I thought Michael was trying to write a tokenizer based on node tree
returned by raw_parser. As it seems Michael is not even sure about what
he's trying to do, I prefer refocus a bit this thread

> A syntax highlighter might get some use
> out of the lexer-output token stream, but I'm afraid from the proposed
> output that people might be expecting more semantic information than
> the lexer can provide. The lexer doesn't, for example, have any clue
> that some keywords are commands and others aren't; nor any very clear
> understanding about the semantic difference between the tokens '='
> and ';'.

Exact, a proper tokenizer function should be able to give some (simple)
information about the type of each token. That is what I tried to define
in this draft with the "type" field :

=> SELECT pgtokenize($script$
SELECT 1;
UPDATE test SET "a"=2;
$script$);

type | pos | value | line
-------------+-----+----------+------
SQL_COMMAND | 1 | 'SELECT' | 1
CONSTANT | 8 | '1' | 1
DELIMITER | 9 | ';' | 1
SQL_COMMAND | 11 | 'UPDATE' | 2
IDENTIFIER | 18 | 'test' | 2
SQL_KEYWORD | 23 | 'SET' | 2
IDENTIFIER | 27 | '"a"' | 2
OPERATOR | 30 | '=' | 2
CONSTANT | 31 | '1' | 2

>
> Also, if all you want is the lexer, it's not that hard to steal psql's
> version and adapt it to your purposes. The lexer doesn't change very
> fast, and it's not that big either.

Stealing the lexer from psql is possible...for C application.
Don't know yet if we could port it to other languages easily and if a
simple lexer would really answer the use cases here.

>
> Anyway, it certainly wouldn't be hard for an add-on module to provide a
> SRF that calls the lexer (or parser) and returns some sort of tabular
> representation of the results. I'm just not sure how useful it'll be
> in the real world.

Well, I would prefer not to tell users of pgAdmin or phpPgAdmin that
they depend on a contrib module.
Moreover, PostgreSQL already expose a lot of informations about its
internal mechanisms, configuration, ddl etc. I think having a proper
tokenizer function is just a natural new functionality for core if possible.

Having dropped an eye here and there in the parser code, I am not sure
where I could get required info and mix them to produce something close
to my draft yet.
But I prefer to discussing first before spending too much time and
throwing any potential code after...

>
> regards, tom lane

- --
JGuillaume (ioguix) de Rorthais
http://www.dalibo.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvXdxgACgkQxWGfaAgowiJujQCglXpCYpFttwHOkmkCd92zMxnv
r00An1sjmRrR6u61VjCtXputcNBevHsz
=ri3i
-----END PGP SIGNATURE-----

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-04-28 00:13:09 Re: testing HS/SR - 1 vs 2 performance
Previous Message Alvaro Herrera 2010-04-27 23:45:15 road.thepath no longer in pg_stats?