From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
Subject: | Re: fulltext parser strange behave |
Date: | 2007-11-08 20:11:44 |
Message-ID: | 47336D80.5000401@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
>>
>>> Tom Lane wrote:
>>>
>>>> Well, the state machine definitely thinks that tag names should
>>>> contain
>>>> only ASCII letters (with possibly a leading or trailing '/').
>>>> Given the
>>>> HTML examples I suppose we should allow non-first digits too. Is
>>>> there
>>>> anything else that should be considered a tag? What about dash and
>>>> underscore for instance?
>>>>
>>
>>
>>> The docs say we specifically accept HTML tags. Are we really just
>>> accepting anything that is a string of ASCII letters as the tag
>>> name? Then we should adjust the docs. <foo> and <foo1234> are not
>>> HTML tags.
>>>
>>
>> I don't think I want to try to maintain a list of exactly which
>> identifiers are considered valid tag names ... and if I did, I wouldn't
>> put it into the parser. It would be a dictionary's job to tell valid
>> from invalid tag names, no?
>>
>>
>>
>
> I don't have a quarrel with that. But then we should be more clear
> about what we are recognizing. We could describe the thing as an
> HTML-like tag, possibly. I think the same probably goes for entities too.
>
>
I've just been looking at the state machine in wparser_def.c. I think
the processing for entities is also a few bob short in the pound. It
recognises decimal numeric character references, but nor hexadecimal
numeric character references. That's fairly silly since the HTML spec
specifically says the latter are "particularly useful". The rules for
named entities are also deficient w.r.t. digits, just like the case of
tags that Tom noticed. This isn't academic: HTML features a number of
named entities with digits in the name (sup2, frac14 for example).
In XML at least, legal names are defined by the following rules from the
spec:
[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
[#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
[#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] |
#xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
Restricting this to ASCII, we get:
[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9]
[5] Name ::= NameStartChar (NameChar)*
or this regex for Name:
[A-Za-z:_][A-Za-z0-9:_.-]*
I suggest we use that or something very close to it as the rule for
names in these patterns.
cheers
andrew
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2007-11-08 21:14:43 | Re: Free Space Map thoughts |
Previous Message | Magnus Hagander | 2007-11-08 19:37:42 | Re: New tzdata available |
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2007-11-09 00:35:49 | Re: tsearch2api .. wrapper for integrated fultext |
Previous Message | Oleg Bartunov | 2007-11-08 05:01:29 | Re: fulltext parser strange behave |