Quick Links

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores

From:	"Dan O'Hara" <danarasoftware(at)gmail(dot)com>
To:	Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
Date:	2009-10-22 19:54:56
Message-ID:	557802370910221254k624306eg81ae6176eb3bd9d4@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

I agree that it isn't easy to determine if given text is a valid email
address. As I couldn't use ts_parse, I ended up using a regex, which
worked substantially better at pulling out the emails from the text
stream. I haven't looked at the code, but perhaps it is possible to
do the same thing here? Even a regex that is 99% correct would be
better than the current tokenizer which is only right about 80-85% of
the time.

My workaround looked something like this:

select regexp_matches(resumetext,E'[A-Z0-9(dot)_%+-]+(at)[A-Z0-9(dot)-]+\\(dot)[A-Z]{2,4}','gi')
as email
from "Resume"
cheers
Dan

On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira
<euler(at)timbira(dot)com> wrote:
> Robert Haas escreveu:
>> I'm not real familiar with ts_parse(), but I'm thinking that it
>> doesn't have any special casing for email addresses and is just
>> intended to parse text for full-text-search - in which case splitting
>> on _ is a pretty good algorithm.
>>
> It is a bug. The tsearch claims to identify types of tokens but it doesn't
> correctly identify any valid e-mail addresses. As Dan stated ts_parse() fails
> to recognize an e-mail address. For example, foo+bar(at)baz(dot)com is a valid e-mail
> but the function fails to report that.
>
> It is not that simple to identify an e-mail address that agrees with RFC. As
> that code is a state machine, IMHO it decides too early (when it finds _) that
> that string is not an e-mail address. AFAIR, that's not an one-line fix.
>
> euler=# select distinct token as email from ts_parse('default',
> 'foo(dot)bar(at)baz(dot)com');
> email
> ─────────────────
> foo(dot)bar(at)baz(dot)com
> (1 row)
>
> euler=# select distinct token as email from ts_parse('default',
> 'foo+bar(at)baz(dot)com');
> email
> ─────────────
> foo
> +
> bar(at)baz(dot)com
> (3 rows)
>
> euler=# select distinct token as email from ts_parse('default',
> 'foo_bar(at)baz(dot)com');
> email
> ─────────────
> foo
> bar(at)baz(dot)com
> _
> (3 rows)
>
>
> --
> Euler Taveira de Oliveira
> http://www.timbira.com/
>

--
-------------------------------------------------------------------
Dan O'Hara
Danara Software Systems, Inc.
danarasoftware(at)gmail(dot)com
613 288-8733

In response to

Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores at 2009-10-22 19:39:36 from Euler Taveira de Oliveira

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Andrew Gierth	2009-10-22 21:14:38	Re: BUG #5126: convert_to preventing index scan
Previous Message	Stephen Frost	2009-10-22 19:42:34	psql -1 -f - busted

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Stark	2009-10-22 20:25:22	Re: per table random-page-cost?
Previous Message	Euler Taveira de Oliveira	2009-10-22 19:39:36	Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores