TODO item: Implement Boyer-Moore searching (First time hacker)

From: "David Rowley" <dgrowley(at)gmail(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: TODO item: Implement Boyer-Moore searching (First time hacker)
Date: 2008-08-31 00:18:26
Message-ID: 4563A3CA8A9D4B7FB9BADF1A7F92F4B3@amd64
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

My apologies if this is a duplicate. I think the attachment size was too
big.

Reference: Bruce Momjian writes: ->
http://archives.postgresql.org/pgsql-committers/2007-09/msg00402.php

Other references: Boyer Moore?? ->
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/fstrpos-exa
mple.html

As a first time hacker of postgresql I've decided to attempt something that
does not go too deep in to the heart of the software.

Before I make an attempt at writing a patch, I've decided to start
benchmarking outside of Postgres. On reading the link underneath the Boyer
Moore item in the in the new TODO list wiki, I learned that someone else had
a shot at speeding up strpos around this time last year using KMP algotithm.
On looking a little deeper into Boyer Moore's algorithm it's quite obvious
that there is some overhead involved with preprocessing the search key
(needle). My target has been to not have any trade offs with this possibly
to be patch. I didn't want smaller searches to suffer.

I'd like to receive some feedback on the various functions in the attached
before I go ahead and work on this any more. I've written 8 different
versions of the function. Version 8 seems to be the fastest out of them,
however I've not tested with multi byte chars.

Please note that these are just some initial ideas. For example startpos is
not yet implemented, but there is very little overhead to that anyway.

For comparision sake I mocked up the current postgresql 8.3's strpos() from
varlena.c. I've no idea how accuratly this will be to the real version.
Probably find out once I write the patch. I replaced strncmp with my own
version as I was having problems with my compilers optimizer, it optimized
too much and I was unable to benchmark. I felt running without and
optimization was unfair as strncmp is. Perhaps if someone else wants to
benchmark with -O2 on gcc they should first replace pg_strncmp with strncmp.

Currently I have a single file which can be compiled itself that contains
benchmarks for the functions I've tested so far. All of these implement a
stripped out version of the Boyer Moore algotithm. I've done away with one
of the character maps in a bid to reduce the overhead involved creating
them.

The position calculation will likely need changed for multi byte characters.
I'm suspecting I'm not meant to be doing length calculations on pointers.
Can someone confirm?

To keep this email short I've put most of the information required in as
comments into the source of the program.

I've uploaded some benchmark results using the attached program. The
benchmarking was done on windows.

Please see http://www.unixbeast.com/~fat/test1.html there are 10 tests in
total, in each I attempt to exploit weaknesses, some with my functions and
some with the existing function.

My method for version 8 of the function probably looks quite unusual as it
does some simple cost based optimisation

I look forward to receiving feedback on this.

David.

Attachment Content-Type Size
BMStrpos.tar.gz application/x-gzip 6.0 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Brendan Jurd 2008-08-31 00:50:16 Re: Attaching error cursor position to invalid constant values
Previous Message Stephen R. van den Berg 2008-08-30 23:29:59 Re: Attaching error cursor position to invalid constant values