Issue Supporting Emojis in Full Text Search on Ubuntu

From: Jordan Hurwich <jhurwich(at)pulsasensors(dot)com>
To: pgsql-novice(at)lists(dot)postgresql(dot)org
Cc: Gautam Bellary <gautam(at)pulsasensors(dot)com>
Subject: Issue Supporting Emojis in Full Text Search on Ubuntu
Date: 2020-06-30 21:57:12
Message-ID: CAJKqsjs22F-sJVks96MO9PC9fZ-8FjK=GZOvQEvvoFVXCopbDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

We rely on the Postgres tsvector implementation to enable full text search
in our app, but we're having some issues with getting the parser to
recognize emoji characters (like "😀" <U+1F600>) as anything other than
"blank"/"Space symbols" on Ubuntu per ts_debug(). Notably the characters
are recognized as "word"/"Word, all letters" characters on Mac; and
non-english, non-emoji characters (like "我" <U+6211>) are recognized as
"word" characters on both Mac and Ubuntu.

We greatly appreciate your feedback, debug details below and happy to
provide more as requested,
Jordan
pulsasensors.com, jhurwich@

Platform:
- AWS Ubuntu 18.04.2 LTS vs MacOS 10.15.5
- postgres (PostgreSQL) 11.5

* ts_debug() differs on MacOS and Ubuntu *

We have not modified the 'english' text search configuration on either
instance, however the query "SELECT * FROM ts_debug('english', '😀');"
returns different results on MacOS 10.15.5 and our Ubuntu instance:
- on MacOS:
db=# select * from ts_debug('english', '😀');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | 😀 | {english_stem} | english_stem | {😀}

- on Ubuntu:
db=# SELECT * from ts_debug('english','😀');
alias | description | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
blank | Space symbols | 😀 | {} | |

Notably non-english, non-emoji characters like '我' behave as desired on
both instances, with the same result on both MacOS and Ubuntu for "SELECT *
FROM ts_debug('english', '我');":
db=# SELECT * FROM ts_debug('english', '我');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | 我 | {english_stem} | english_stem | {我}

* pg_database *

There are minor differences between MacOS and Ubuntu in pg_database as
follows, however modifications to set datcollate and datctype to 'C' on
Ubuntu or the more specific 'en_US.UTF-8' have not changed the result for
ts_debug(). See row for 'testdb01':
- on Mac:
db=# select datname, encoding, datcollate, datctype, datistemplate from
pg_database;
datname | encoding | datcollate | datctype | datistemplate
----------------+----------+------------+----------+---------------
postgres | 6 | C | C | f
template0 | 6 | C | C | t
template1 | 6 | C | C | t
testdb01 | 6 | C | C | f

- on Ubuntu:
db=# select datname, encoding, datcollate, datctype, datistemplate from
pg_database;
datname | encoding | datcollate | datctype | datistemplate
-----------+----------+-------------+-------------+---------------
postgres | 6 | C.UTF-8 | C.UTF-8 | f
template0 | 6 | C.UTF-8 | C.UTF-8 | t
template1 | 6 | en_US.UTF-8 | en_US.UTF-8 | t
testdb01 | 6 | en_US.UTF-8 | en_US.UTF-8 | f

* locale *

The result of `$ locale` on both instances is similar, included below for
Ubuntu. Though `$ locale -a` varies considerably, on MacOS dozens of items
are returned while only 4 entries are returned on Ubuntu, included below:

- on Ubuntu
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

- on Ubuntu
$ locale -a
C
C.UTF-8
en_US.utf8
POSIX

* Postgres installation *

On Mac, Postgres was installed and is managed by Homebrew via the
"postgresql(at)11" formula.
On Ubuntu, Postgres was installed from source at
https://ftp.postgresql.org/pub/source/v11.5/postgresql-11.5.tar.bz2.

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Tom Lane 2020-07-01 00:27:23 Re: Issue Supporting Emojis in Full Text Search on Ubuntu
Previous Message Laurenz Albe 2020-06-24 06:17:15 Re: manual for syntax for pg Query Tool