Re: BUG #18149: Incorrect lexeme for english token "proxy"

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Patrick Peralta <pperalta(at)gmail(dot)com>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date: 2023-10-07 16:37:37
Message-ID: 3149021.1696696657@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Patrick Peralta <pperalta(at)gmail(dot)com> writes:
> However I ran into an anomaly with this query:

> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
> ?column?
> ----------
> f
> (1 row)

Hmm. Investigating that a bit:

regression=# select * from ts_debug('english', 'cloud-proxy');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------------+----------------+--------------+---------------
asciihword | Hyphenated word, all ASCII | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
hword_asciipart | Hyphenated word part, all ASCII | cloud | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | proxy | {english_stem} | english_stem | {proxi}
(4 rows)

regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
numhword | Hyphenated word, letters and digits | CLOUD-PROXY-SEP19-T1 | {simple} | simple | {cloud-proxy-sep19-t1}
hword_asciipart | Hyphenated word part, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | - | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | - | {} | |
int | Signed integer | -1695167380256 | {simple} | simple | {-1695167380256}
(12 rows)

So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all. Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.

While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code. Is there any chance of adjusting the notation you're dealing
with here? I get sane-looking results from, for example,

regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
to_tsvector
----------------------------------------------------------------------------------------------
'-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)

If that data format is being imposed on you then I'm not seeing a good
solution without custom C code. I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Patrick Peralta 2023-10-07 18:22:58 Re: BUG #18149: Incorrect lexeme for english token "proxy"
Previous Message Patrick Peralta 2023-10-07 16:07:18 Re: BUG #18149: Incorrect lexeme for english token "proxy"