From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Thom Brown <thom(at)linux(dot)com> |
Cc: | PGSQL Mailing List <pgsql-general(at)postgresql(dot)org>, chris(at)chrullrich(dot)net |
Subject: | Re: Text search parser's treatment of URLs and emails |
Date: | 2011-02-01 20:14:30 |
Message-ID: | 201102012014.p11KEUi17488@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I have added this as a TODO:
* Improve handling of plus signs in email address user
names, and perhaps improve URL parsing
* http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
---------------------------------------------------------------------------
Thom Brown wrote:
> Hi,
>
> I noticed that if I run this:
>
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
>
> I get:
>
> alias | description | token
> ----------+---------------+-----------------------------------------------------------------
> protocol | Protocol head | http://
> url | URL |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
> host | Host | www.postgresql.org:2345
> url_path | URL path |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
>
>
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path. Ideally, I'd sort of expect:
>
> alias | description | token
> --------------+---------------+-----------------------------------------------------------------
> protocol | Protocol head | http://
> url | URL |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
> host | Host | www.postgresql.org
> port | Port | 2345
> url_path | URL path | /directory/page.html
> query_string | Query string | version=9.1&build=alpha1
> fragment | Page fragment | summary
> (7 rows)
>
> ... of course that's if there was support for query strings and page
> fragments, which there isn't. But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
>
> But my main gripe is with the name "url_path".
>
> Also:
>
> SELECT alias, description, token FROM ts_debug('myname+priority(at)gmail(dot)com');
>
> Yields:
>
> alias | description | token
> -----------+-----------------+--------------------
> asciiword | Word, all ASCII | myname
> blank | Space symbols | +
> email | Email address | priority(at)gmail(dot)com
> (3 rows)
>
> The entire string I entered is a valid email address, and isn't
> totally uncommon. Shouldn't that take such email address styles be
> taken into account? The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname(at)gmail(dot)com(dot)
>
> --
> Thom Brown
> Twitter: @darkixion
> IRC (freenode): dark_ixion
> Registered Linux user: #516935
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Smith | 2011-02-01 20:15:53 | Re: Book recommendation? |
Previous Message | John R Pierce | 2011-02-01 20:07:07 | yum repo problem |