Re: BUG #18149: Incorrect lexeme for english token "proxy"

From: Patrick Peralta <pperalta(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date: 2023-10-07 18:22:58
Message-ID: CADV9oGz+JRYg2n1X2Jzknr-2YSAjcGT9FJPY0wiMNBnV+ZJUgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

>
> So the difficulty is that (a) the default TS parser doesn't break down
> this multiply-hyphenated word quite the way you'd hoped
>

I see, thanks for pointing this out. Now I understand that this isn't a
lexeme issue; rather it's a side effect of how the parser handles hyphens.

I tried this out with underscores which produces a better result:

# select * from ts_debug('english',
'CLOUD_PROXY_SEP19_T1_254__1695167380256');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+--------------------------+---------------+----------------+--------------+-----------------
asciiword | Word, all ASCII | CLOUD | {english_stem} |
english_stem | {cloud}
blank | Space symbols | _ | {} |
|
asciiword | Word, all ASCII | PROXY | {english_stem} |
english_stem | {proxi}
blank | Space symbols | _ | {} |
|
numword | Word, letters and digits | SEP19 | {simple} |
simple | {sep19}
blank | Space symbols | _ | {} |
|
numword | Word, letters and digits | T1 | {simple} |
simple | {t1}
blank | Space symbols | _ | {} |
|
uint | Unsigned integer | 254 | {simple} |
simple | {254}
blank | Space symbols | __ | {} |
|
uint | Unsigned integer | 1695167380256 | {simple} |
simple | {1695167380256}
(11 rows)

# SELECT to_tsvector('english', 'CLOUD_PROXY__SEP19_T1-254__1695167380256')
@@ to_tsquery('english','cloud_proxy:*');
?column?
----------
t

If that data format is being imposed on you

These are freeform fields so we are not imposing a data format. However I
wonder if we can do some data "massaging" to deal with hyphens. Writing our
own parser is definitely overkill for this edge condition.

Really appreciate this insight, it was very helpful! We can close this
ticket.

Thanks,
Patrick

On Sat, Oct 7, 2023 at 12:37 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Patrick Peralta <pperalta(at)gmail(dot)com> writes:
> > However I ran into an anomaly with this query:
>
> > # SELECT to_tsvector('english',
> 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
> > @@ to_tsquery('english','cloud-proxy:*');
> > ?column?
> > ----------
> > f
> > (1 row)
>
> Hmm. Investigating that a bit:
>
> regression=# select * from ts_debug('english', 'cloud-proxy');
> alias | description | token |
> dictionaries | dictionary | lexemes
>
> -----------------+---------------------------------+-------------+----------------+--------------+---------------
> asciihword | Hyphenated word, all ASCII | cloud-proxy |
> {english_stem} | english_stem | {cloud-proxi}
> hword_asciipart | Hyphenated word part, all ASCII | cloud |
> {english_stem} | english_stem | {cloud}
> blank | Space symbols | - | {}
> | |
> hword_asciipart | Hyphenated word part, all ASCII | proxy |
> {english_stem} | english_stem | {proxi}
> (4 rows)
>
> regression=# select * from ts_debug('english',
> 'CLOUD-PROXY-SEP19-T1-254--1695167380256');
> alias | description |
> token | dictionaries | dictionary | lexemes
>
> -----------------+------------------------------------------+----------------------+----------------+--------------+------------------------
> numhword | Hyphenated word, letters and digits |
> CLOUD-PROXY-SEP19-T1 | {simple} | simple |
> {cloud-proxy-sep19-t1}
> hword_asciipart | Hyphenated word part, all ASCII | CLOUD
> | {english_stem} | english_stem | {cloud}
> blank | Space symbols | -
> | {} | |
> hword_asciipart | Hyphenated word part, all ASCII | PROXY
> | {english_stem} | english_stem | {proxi}
> blank | Space symbols | -
> | {} | |
> hword_numpart | Hyphenated word part, letters and digits | SEP19
> | {simple} | simple | {sep19}
> blank | Space symbols | -
> | {} | |
> hword_numpart | Hyphenated word part, letters and digits | T1
> | {simple} | simple | {t1}
> blank | Space symbols | -
> | {} | |
> uint | Unsigned integer | 254
> | {simple} | simple | {254}
> blank | Space symbols | -
> | {} | |
> int | Signed integer |
> -1695167380256 | {simple} | simple | {-1695167380256}
> (12 rows)
>
> So the difficulty is that (a) the default TS parser doesn't break down
> this multiply-hyphenated word quite the way you'd hoped, and (b) fragments
> classified as numhword aren't passed through the english_stem dictionary
> at all. Also, (c) I'm doubtful that the snowball stemmer would have
> converted cloud-proxy-sep19-t1 to cloud-proxi-sep19-t1; but it didn't get
> the chance anyway.
>
> While (b) would be easy to address with a custom TS configuration,
> (a) and (c) can't be fixed without getting your hands dirty in
> C code. Is there any chance of adjusting the notation you're dealing
> with here? I get sane-looking results from, for example,
>
> regression=# select to_tsvector('english',
> 'CLOUD-PROXY--SEP19-T1-254--1695167380256');
> to_tsvector
>
>
> ----------------------------------------------------------------------------------------------
> '-1695167380256':8 '254':7 'cloud':2 'cloud-proxi':1 'proxi':3 'sep19':5
> 'sep19-t1':4 't1':6
> (1 row)
>
> If that data format is being imposed on you then I'm not seeing a good
> solution without custom C code. I'd be inclined to try to make the
> parser generate all of "cloud-proxy-sep19-t1", "cloud-proxy-sep19",
> "cloud-proxy" from this input, but a custom TS parser is kind of a
> high bar to clear.
>
> regards, tom lane
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Richard Guo 2023-10-08 08:26:43 Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower
Previous Message Tom Lane 2023-10-07 16:37:37 Re: BUG #18149: Incorrect lexeme for english token "proxy"