Re: Issue Supporting Emojis in Full Text Search on Ubuntu

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jordan Hurwich <jhurwich(at)pulsasensors(dot)com>
Cc: pgsql-novice(at)lists(dot)postgresql(dot)org, Gautam Bellary <gautam(at)pulsasensors(dot)com>
Subject: Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Date: 2020-07-01 00:27:23
Message-ID: 547627.1593563243@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Jordan Hurwich <jhurwich(at)pulsasensors(dot)com> writes:
> We rely on the Postgres tsvector implementation to enable full text search
> in our app, but we're having some issues with getting the parser to
> recognize emoji characters (like "😀" <U+1F600>) as anything other than
> "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> are recognized as "word"/"Word, all letters" characters on Mac; and
> non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> "word" characters on both Mac and Ubuntu.

With a non-C lc_ctype setting, which characters are considered to be
letters is entirely up to the locale code supplied by the platform's libc.
I duplicate your result that U+1F600 is not considered a letter by glibc,
at least not with lc_ctype = en_US.utf8. (Perhaps there are other locale
settings that would consider it a letter? Not my expertise though.)

However, with lc_ctype set to C, you should get a platform-independent
result that any non-ASCII character is a letter.

macOS' utf8-based locales are almost completely broken, unfortunately,
so it's hard to predict what they will do except that it'll probably be
wrong.

> We have not modified the 'english' text search configuration on either
> instance, however the query "SELECT * FROM ts_debug('english', '😀');"
> returns different results on MacOS 10.15.5 and our Ubuntu instance:

That's unsurprising in itself, per the above. There's no standardization
worth mentioning in this area.

> There are minor differences between MacOS and Ubuntu in pg_database as
> follows, however modifications to set datcollate and datctype to 'C' on
> Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> ts_debug().

lc_ctype = C should have done what you want (at least for this specific
symbol), so I wonder whether you did the test right. You gave few details
about how you tested this.

> db=# select datname, encoding, datcollate, datctype, datistemplate from
> pg_database;
> datname | encoding | datcollate | datctype | datistemplate
> ----------------+----------+------------+----------+---------------
> postgres | 6 | C | C | f
> template0 | 6 | C | C | t
> template1 | 6 | C | C | t
> testdb01 | 6 | C | C | f

Hm, how are you getting that result when your prompt says you're
connected to database "db"?

> The result of `$ locale` on both instances is similar, included below for
> Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
> are returned while only 4 entries are returned on Ubuntu, included below:
> - on Ubuntu
> $ locale -a
> C
> C.UTF-8
> en_US.utf8
> POSIX

This just indicates that you didn't install the package(s) that provide
alternative locales. On my RHEL box, "locale -a" reports 865 entries,
and I'm pretty sure I don't have all the odder ones. I do not know,
unfortunately, how Ubuntu/Debian divvy this stuff up into packages.

Anyway, the bottom line is that if you need platform-independent results
then setting lc_ctype to "C" is what to do. It will not be very bright
about non-ASCII characters, but at least the stupidity will be uniform.

regards, tom lane

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Chamath Sajeewa 2020-07-01 14:38:14 PostgreSQL Module to use Oracle specific functions as it is.
Previous Message Jordan Hurwich 2020-06-30 21:57:12 Issue Supporting Emojis in Full Text Search on Ubuntu