Re: Issue Supporting Emojis in Full Text Search on Ubuntu

From: Jordan Hurwich <jhurwich(at)pulsasensors(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-novice(at)lists(dot)postgresql(dot)org, Gautam Bellary <gautam(at)pulsasensors(dot)com>
Subject: Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Date: 2020-07-01 21:56:14
Message-ID: CAJKqsjvUZUwyYT6438C8VjnWzDJ9EL7qM_-pV0nwdQpERgW1MA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Thank you very much Tom!

You were completely right to question our previous manipulation of the
pg_database entry, and after revisiting that we were able to address the
issue by updating datcollate and datctype to "C" - ts_debug() now
recognizes emoji characters as "word" characters and text search
functionality is behaving as expected.

Thank you once again for taking the time to understand and figure out our
issue,
Jordan

On Tue, Jun 30, 2020 at 5:27 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jordan Hurwich <jhurwich(at)pulsasensors(dot)com> writes:
> > We rely on the Postgres tsvector implementation to enable full text
> search
> > in our app, but we're having some issues with getting the parser to
> > recognize emoji characters (like "😀" <U+1F600>) as anything other than
> > "blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
> > are recognized as "word"/"Word, all letters" characters on Mac; and
> > non-english, non-emoji characters (like "我" <U+6211>) are recognized as
> > "word" characters on both Mac and Ubuntu.
>
> With a non-C lc_ctype setting, which characters are considered to be
> letters is entirely up to the locale code supplied by the platform's libc.
> I duplicate your result that U+1F600 is not considered a letter by glibc,
> at least not with lc_ctype = en_US.utf8. (Perhaps there are other locale
> settings that would consider it a letter? Not my expertise though.)
>
> However, with lc_ctype set to C, you should get a platform-independent
> result that any non-ASCII character is a letter.
>
> macOS' utf8-based locales are almost completely broken, unfortunately,
> so it's hard to predict what they will do except that it'll probably be
> wrong.
>
> > We have not modified the 'english' text search configuration on either
> > instance, however the query "SELECT * FROM ts_debug('english', '😀');"
> > returns different results on MacOS 10.15.5 and our Ubuntu instance:
>
> That's unsurprising in itself, per the above. There's no standardization
> worth mentioning in this area.
>
> > There are minor differences between MacOS and Ubuntu in pg_database as
> > follows, however modifications to set datcollate and datctype to 'C' on
> > Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
> > ts_debug().
>
> lc_ctype = C should have done what you want (at least for this specific
> symbol), so I wonder whether you did the test right. You gave few details
> about how you tested this.
>
> > db=# select datname, encoding, datcollate, datctype, datistemplate from
> > pg_database;
> > datname | encoding | datcollate | datctype | datistemplate
> > ----------------+----------+------------+----------+---------------
> > postgres | 6 | C | C | f
> > template0 | 6 | C | C | t
> > template1 | 6 | C | C | t
> > testdb01 | 6 | C | C | f
>
> Hm, how are you getting that result when your prompt says you're
> connected to database "db"?
>
> > The result of `$ locale` on both instances is similar, included below for
> > Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of
> items
> > are returned while only 4 entries are returned on Ubuntu, included below:
> > - on Ubuntu
> > $ locale -a
> > C
> > C.UTF-8
> > en_US.utf8
> > POSIX
>
> This just indicates that you didn't install the package(s) that provide
> alternative locales. On my RHEL box, "locale -a" reports 865 entries,
> and I'm pretty sure I don't have all the odder ones. I do not know,
> unfortunately, how Ubuntu/Debian divvy this stuff up into packages.
>
> Anyway, the bottom line is that if you need platform-independent results
> then setting lc_ctype to "C" is what to do. It will not be very bright
> about non-ASCII characters, but at least the stupidity will be uniform.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Bzzzz 2020-07-01 22:31:16 Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Previous Message Stephen Frost 2020-07-01 14:41:50 Re: PostgreSQL Module to use Oracle specific functions as it is.