Re: endash not a graphic character?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: rob stone <floriparob(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: endash not a graphic character?
Date: 2016-08-21 18:24:16
Message-ID: 25067.1471803856@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> However I am wondering about my use of [[:graph:]] to match characters
> that have glyphs. I was not expecting there to be characters that have
> glyphs to not be in the graph class. In the short term I might want to
> change the way I am testing that.

[ looks into code... ] The [[:foo:]] notations only work up to Unicode
code point U+7FF at the moment, per this comment in regc_pg_locale.c:

* Decide how many character codes we ought to look through. For C locale
* there's no need to go further than 127. Otherwise, if the encoding is
* UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
* extend it as far as we'd like (say, 0xFFFF, the end of the Basic
* Multilingual Plane) without creating significant performance issues due
* to too many characters being fed through the colormap code. This will
* need redesign to fix reasonably, but at least for the moment we have
* all common European languages covered. Otherwise (not C, not UTF8) go
* up to 255. These limits are interrelated with restrictions discussed
* at the head of this file.

Unfortunately, these particular characters are U+2013 and U+2014 so you
lose.

Obviously there's room for improvement here, but so far nobody's been
motivated to work on it. Last discussion about it (AFAIR) was this
thread:

https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us

I'm not sure if any of the subsequent work on the regex engine would
make it any easier to fix than it seemed at the time.

regards, tom lane

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruno Wolff III 2016-08-21 18:55:19 Re: endash not a graphic character?
Previous Message Bruno Wolff III 2016-08-21 18:03:33 Re: endash not a graphic character?