From: | Oleg Bartunov <obartunov(at)gmail(dot)com> |
---|---|
To: | Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)postgrespro(dot)ru> |
Subject: | Re: old bug in full text parser |
Date: | 2016-02-10 10:04:07 |
Message-ID: | CAF4Au4xrkE5yHbNDBg+0Cn0VLKm9c+SD13No0yUix483_F2bvw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Feb 10, 2016 at 12:28 PM, Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:
> It looks like there is a very old bug in full text parser (somebody
> pointed me on it), which appeared after moving tsearch2 into the core. The
> problem is in how full text parser process hyphenated words. Our original
> idea was to report hyphenated word itself as well as its parts and ignore
> hyphen. That was how tsearch2 works.
>
> This behaviour was changed after moving tsearch2 into the core:
> 1. hyphen now reported by parser, which is useless.
> 2. Hyphenated words with numbers ('4-dot', 'dot-4') processed
> differently than ones with plain text words like 'four-dot', no hyphenated
> word itself reported.
>
> I think we should consider this as a bug and produce fix for all supported
> versions.
>
> After investigation we found this commit:
>
> commit 73e6f9d3b61995525785b2f4490b465fe860196b
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: Sat Oct 27 19:03:45 2007 +0000
>
> Change text search parsing rules for hyphenated words so that digit
> strings
> containing decimal points aren't considered part of a hyphenated word.
> Sync the hyphenated-word lookahead states with the subsequent
> part-by-part
> reparsing states so that we don't get different answers about how much
> text
> is part of the hyphenated word. Per my gripe of a few days ago.
>
>
> 8.2.23
>
> select tok_type, description, token from ts_debug('dot-four');
> tok_type | description | token
> -------------+-------------------------------+----------
> lhword | Latin hyphenated word | dot-four
> lpart_hword | Latin part of hyphenated word | dot
> lpart_hword | Latin part of hyphenated word | four
> (3 rows)
>
> select tok_type, description, token from ts_debug('dot-4');
> tok_type | description | token
> -------------+-------------------------------+-------
> hword | Hyphenated word | dot-4
> lpart_hword | Latin part of hyphenated word | dot
> uint | Unsigned integer | 4
> (3 rows)
>
> select tok_type, description, token from ts_debug('4-dot');
> tok_type | description | token
> ----------+------------------+-------
> uint | Unsigned integer | 4
> lword | Latin word | dot
> (2 rows)
>
> 8.3.23
>
> select alias, description, token from ts_debug('dot-four');
> alias | description | token
> -----------------+---------------------------------+----------
> asciihword | Hyphenated word, all ASCII | dot-four
> hword_asciipart | Hyphenated word part, all ASCII | dot
> blank | Space symbols | -
> hword_asciipart | Hyphenated word part, all ASCII | four
> (4 rows)
>
> select alias, description, token from ts_debug('dot-4');
> alias | description | token
> -----------+-----------------+-------
> asciiword | Word, all ASCII | dot
> int | Signed integer | -4
> (2 rows)
>
> select alias, description, token from ts_debug('4-dot');
> alias | description | token
> -----------+------------------+-------
> uint | Unsigned integer | 4
> blank | Space symbols | -
> asciiword | Word, all ASCII | dot
> (3 rows)
>
>
Oh, one more bug, which existed even in tsearch2.
select tok_type, description, token from ts_debug('4-dot');
tok_type | description | token
----------+------------------+-------
uint | Unsigned integer | 4
lword | Latin word | dot
(2 rows)
>
> Regards,
> Oleg
>
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Bapat | 2016-02-10 12:12:36 | Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs) |
Previous Message | Andres Freund | 2016-02-10 09:54:39 | Re: Relation extension scalability |