From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | teodor(at)sigaev(dot)ru, pgsql-hackers(at)postgresql(dot)org |
Subject: | [PATCH] tsearch parser inefficiency if text includes urls or emails |
Date: | 2009-11-01 15:19:43 |
Message-ID: | 200911011619.44683.andres@anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
While playing around/evaluating tsearch I notices that to_tsvector is
obscenely slow for some files. After some profiling I found that this is due
using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c.
If a multibyte encoding is in use TParserInit copies the whole remaining input
and converts it to wchar_t or pg_wchar - for every email or protocol prefixed
url in the the document. Which obviously is bad.
I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which
reuses the the already converted strings of the original TParser - only at
different offsets.
Another approach would be to get rid of the separate parser invocations -
requiring a bunch of additional states. This seemed more complex to me, so I
wanted to get some feedback first.
Without patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';
─────────────────────────────────────────────────────────────────────────────────────────────────────
...
(1 row)
Time: 5835.676 ms
With patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';
─────────────────────────────────────────────────────────────────────────────────────────────────────
...
(1 row)
Time: 395.341 ms
Ill cleanup the patch if it seems like a sensible solution...
Is this backpatch-worthy?
Andres
PS: I let the additional define in for the moment so that its easier to see the
performance differences.
Attachment | Content-Type | Size |
---|---|---|
reuse-strings-in-tparser-recursion.patch | text/x-patch | 3.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Tiikkaja | 2009-11-01 15:22:03 | Re: WIP: push AFTER-trigger execution into ModifyTable node |
Previous Message | Tom Lane | 2009-11-01 15:12:41 | Re: WIP: push AFTER-trigger execution into ModifyTable node |