From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Patrick Peralta <pperalta(at)gmail(dot)com> |
Cc: | Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18149: Incorrect lexeme for english token "proxy" |
Date: | 2023-10-07 16:37:37 |
Message-ID: | 3149021.1696696657@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Patrick Peralta <pperalta(at)gmail(dot)com> writes:
> However I ran into an anomaly with this query:
> # SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> @@ to_tsquery('english','cloud-proxy:*');
> ?column?
> ----------
> f
> (1 row)
Hmm. Investigating that a bit:
regression=# select * from ts_debug('english', 'cloud-proxy');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+---------------------------------+-------------+----------------+--------------+---------------
asciihword | Hyphenated word, all ASCII | cloud-proxy | {english_stem} | english_stem | {cloud-proxi}
hword_asciipart | Hyphenated word part, all ASCII | cloud | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | proxy | {english_stem} | english_stem | {proxi}
(4 rows)
regression=# select * from ts_debug('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
alias | description | token | dictionaries | dictionary | lexemes
-----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
numhword | Hyphenated word, letters and digits | CLOUD-PROXY-SEP19-T1 | {simple} | simple | {cloud-proxy-sep19-t1}
hword_asciipart | Hyphenated word part, all ASCII | CLOUD | {english_stem} | english_stem | {cloud}
blank | Space symbols | - | {} | |
hword_asciipart | Hyphenated word part, all ASCII | PROXY | {english_stem} | english_stem | {proxi}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | SEP19 | {simple} | simple | {sep19}
blank | Space symbols | - | {} | |
hword_numpart | Hyphenated word part, letters and digits | T1 | {simple} | simple | {t1}
blank | Space symbols | - | {} | |
uint | Unsigned integer | 254 | {simple} | simple | {254}
blank | Space symbols | - | {} | |
int | Signed integer | -1695167380256 | {simple} | simple | {-1695167380256}
(12 rows)
So the difficulty is that (a) the default TS parser doesn't break down
this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
classified as numhword aren't passed through the english_stem dictionary
at all. Also, (c) I'm doubtful that the snowball stemmer would have
converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
the chance anyway.
While (b) would be easy to address with a custom TS configuration,
(a) and (c) can't be fixed without getting your hands dirty in
C code. Is there any chance of adjusting the notation you're dealing
with here? I get sane-looking results from, for example,
regression=# select to_tsvector('english', 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
to_tsvector
----------------------------------------------------------------------------------------------
'-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5 'sep19-t1':4 't1':6
(1 row)
If that data format is being imposed on you then I'm not seeing a good
solution without custom C code. I'd be inclined to try to make the
parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
"cloud-proxy" from this input, but a custom TS parser is kind of a
high bar to clear.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Patrick Peralta | 2023-10-07 18:22:58 | Re: BUG #18149: Incorrect lexeme for english token "proxy" |
Previous Message | Patrick Peralta | 2023-10-07 16:07:18 | Re: BUG #18149: Incorrect lexeme for english token "proxy" |