From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jordan Hurwich <jhurwich(at)pulsasensors(dot)com> |
Cc: | pgsql-novice(at)lists(dot)postgresql(dot)org, Gautam Bellary <gautam(at)pulsasensors(dot)com> |
Subject: | Re: Issue Supporting Emojis in Full Text Search on Ubuntu |
Date: | 2020-07-01 00:27:23 |
Message-ID: | 547627.1593563243@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
Jordan Hurwich <jhurwich(at)pulsasensors(dot)com> writes:
> We rely on the Postgres tsvector implementation to enable full text search
> in our app, but we're having some issues with getting the parser to
> recognize emoji characters (like "ð" <U+1F600>) as anything other than
> "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> are recognized as "word"/"Word, all letters" characters on Mac; and
> non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> "word" characters on both Mac and Ubuntu.
With a non-C lc_ctype setting, which characters are considered to be
letters is entirely up to the locale code supplied by the platform's libc.
I duplicate your result that U+1F600 is not considered a letter by glibc,
at least not with lc_ctype = en_US.utf8. (Perhaps there are other locale
settings that would consider it a letter? Not my expertise though.)
However, with lc_ctype set to C, you should get a platform-independent
result that any non-ASCII character is a letter.
macOS' utf8-based locales are almost completely broken, unfortunately,
so it's hard to predict what they will do except that it'll probably be
wrong.
> We have not modified the 'english' text search configuration on either
> instance, however the query "SELECT * FROM ts_debug('english', 'ð');"
> returns different results on MacOS 10.15.5 and our Ubuntu instance:
That's unsurprising in itself, per the above. There's no standardization
worth mentioning in this area.
> There are minor differences between MacOS and Ubuntu in pg_database as
> follows, however modifications to set datcollate and datctype to 'C' on
> Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> ts_debug().
lc_ctype = C should have done what you want (at least for this specific
symbol), so I wonder whether you did the test right. You gave few details
about how you tested this.
> db=# select datname, encoding, datcollate, datctype, datistemplate from
> pg_database;
> datname | encoding | datcollate | datctype | datistemplate
> ----------------+----------+------------+----------+---------------
> postgres | 6 | C | C | f
> template0 | 6 | C | C | t
> template1 | 6 | C | C | t
> testdb01 | 6 | C | C | f
Hm, how are you getting that result when your prompt says you're
connected to database "db"?
> The result of `$ locale` on both instances is similar, included below for
> Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
> are returned while only 4 entries are returned on Ubuntu, included below:
> - on Ubuntu
> $ locale -a
> C
> C.UTF-8
> en_US.utf8
> POSIX
This just indicates that you didn't install the package(s) that provide
alternative locales. On my RHEL box, "locale -a" reports 865 entries,
and I'm pretty sure I don't have all the odder ones. I do not know,
unfortunately, how Ubuntu/Debian divvy this stuff up into packages.
Anyway, the bottom line is that if you need platform-independent results
then setting lc_ctype to "C" is what to do. It will not be very bright
about non-ASCII characters, but at least the stupidity will be uniform.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Chamath Sajeewa | 2020-07-01 14:38:14 | PostgreSQL Module to use Oracle specific functions as it is. |
Previous Message | Jordan Hurwich | 2020-06-30 21:57:12 | Issue Supporting Emojis in Full Text Search on Ubuntu |