From: | "Phil Endecott" <spam_from_pgsql_lists(at)chezphil(dot)org> |
---|---|
To: | <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Text-indexing UTF-8 bytea, convert_from() immutability, null bytes... |
Date: | 2018-10-06 21:56:15 |
Message-ID: | 1538862975270@dmwebmail.dmwebmail.chezphil.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Dear Experts,
I have a table that contains bytea data which sometimes is UTF-8 text.
When it is - and that is indicated by another column - I want to
text-search index it. Something like this:
db=> create index ix on tbl using gin (to_tsvector('english',body)) where is_utf8_text;
(Note my client encoding is UTF-8.)
That doesn't work because to_tsvector doesn't take bytea. So I tried
to_tsvector('english',body::text) ; that almost works, but CRLFs in the
data become \015\012 (i.e. 6 characters) in the text and the tsvector
contains things like '12hello'.
Next I tried to_tsvector('english',convert_from(body::text,'UTF-8')). That
doesn't work because convert_from is not immutable. (This is 9.6; maybe that
has changed.) Is there a good reason for that? Maybe because I might change
the client encoding? As a hack I tried ALTER FUNCTION to make it immutable,
and now I get:
ERROR: invalid byte sequence for encoding "UTF8": 0x00
Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean? But
actually I'd be more than happy to ignore invalid UTF-8 here, since I'm
only using it for text search; there may be some truly invalid UTF-8
in the data. Is there a "permissive" mode for charset conversion?
(That error also suggests that the convert_from is not optimising the
conversion from UTF-8 to UTF-8 to a no-op.)
Anyway: given the problem of creating a text search index over bytea data
that contains UTF-8 text, which may include oddities like null bytes, what
would you do?
Thanks for any suggestions!
Phil.
From | Date | Subject | |
---|---|---|---|
Next Message | Paul McGarry | 2018-10-07 02:20:53 | Re: Why the index is not used ? |
Previous Message | Vladimir Ryabtsev | 2018-10-06 16:51:24 | Re: Why the index is not used ? |