From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | valgog(at)gmail(dot)com |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #6375: tsearch does not recognize all valid emails |
Date: | 2012-02-07 17:41:38 |
Message-ID: | 20120207174138.GL19450@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog(at)gmail(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 6375
> Logged by: Valentine Gogichashvili
> Email address: valgog(at)gmail(dot)com
> PostgreSQL version: 9.1.1
> Operating system: Debian 4.4.5-8
> Description:
>
> Hello,
>
> default tsearch parser does not recognize all valid email addresses and
> tokenizes them as text, splitting into tokens.
>
> For example:
>
> postgres=# select to_tsquery('simple', 'normal(at)email(dot)com' );
> to_tsquery
> ────────────────────
> 'normal(at)email(dot)com'
> (1 row)
>
> here it behaves ok;
>
> postgres=# select to_tsquery('simple', '-still-normal(at)email(dot)com' );
> to_tsquery
> ──────────────────────────
> 'still-normal(at)email(dot)com'
> (1 row)
>
> here it trims '-' from the beginning of an email. This is not correct, but
> will at least find that email.
>
> postgres=# select to_tsquery('simple', '-not-normal-with-dash-(at)email(dot)com'
> );
> to_tsquery
>
> ───────────────────────────────────────────────────────────────────────────────
> 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
> (1 row)
>
> and this is now a real problem as it leads to finding emails that are not
> the same, but are "super-sets" of that one.
>
> Valid email characters, that are not correctly treated also are at least '+'
> and '.'
Yep. :-(
You can see the oddness here:
test=> SELECT alias, description, token FROM ts_debug('-myname(at)gmail(dot)com');
alias | description | token
-------+---------------+------------------
blank | Space symbols | -
email | Email address | myname(at)gmail(dot)com
(2 rows)
test=> SELECT alias, description, token FROM ts_debug('-myna-me(at)gmail(dot)com');
alias | description | token
-------+---------------+-------------------
blank | Space symbols | -
email | Email address | myna-me(at)gmail(dot)com
(2 rows)
test=> SELECT alias, description, token FROM ts_debug('-myna-me-(at)gmail(dot)com');
alias | description | token
-----------------+---------------------------------+-----------
blank | Space symbols | -
asciihword | Hyphenated word, all ASCII | myna-me
hword_asciipart | Hyphenated word part, all ASCII | myna
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | me
blank | Space symbols | -@
host | Host | gmail.com
(7 rows)
The first and second show that the leading-dash is separated. The third
ones shows that a trailing dash causes the middle-dash to also be
separated.
This email thread from 2010 has a similar problem:
http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.
I have added your email to the existing TODO item:
http://wiki.postgresql.org/wiki/Todo#Text_Search
Improve handling of dash and plus signs in email address user names, and
perhaps improve URL parsing
http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
tsearch does not recognize all valid emails
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
From | Date | Subject | |
---|---|---|---|
Next Message | hokie10 | 2012-02-08 00:14:07 | BUG #6438: I have reinstalled postgresql a couple times and now the postgresql service will not start. |
Previous Message | a.tanaka77 | 2012-02-07 01:58:59 | BUG #6436: ecpg processed wrong variable name for host value of struct at EXEC SQL INSERT |