From: | Eric B(dot)Ridge <ebr(at)tcdi(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Postgres + Xapian (was Re: fulltext searching via a custom index type ) |
Date: | 2004-01-02 04:19:07 |
Message-ID: | CE6BCC23-3CDA-11D8-BF11-000A95D98B3E@tcdi.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
On Dec 26, 2003, at 4:04 PM, Tom Lane wrote:
> Eric Ridge <ebr(at)tcdi(dot)com> writes:
>> Xapian has it's own storage subsystem, and that's what I'm using to
>> store the index... not using anything internal to postgres (although
>> this could change).
>
> I would say you have absolutely zero chance of making it work that way.
I still think this is one of the best quotes I've heard in awhile. :)
> It might be worth pointing out here than an index AM is not bound to
> use
> exactly the typical Postgres page layout.
Thanks again for this little bit of info. It was just enough to get me
thinking about how to make "it work".
Xapian is basically a big btree index, except it's 5 btree indexes.
One for terms, one for posts (terms with positions), one for positions,
one for arbitrary document values, and one for the documents
themselves. Each index is made up of 3 physical files on disk. All
told there's 17 files for a single Xapian index (15 db files, a
versioninfo file, and a lock file).
I couldn't think of a way to create a whole new database type for
Xapian that could deal with managing 5 btree indexes inside of Postgres
(other than using tables w/ standard postgres btree index on certain
fields), so instead, I dug into Xapian and abstracted out it's
filesystem i/o (open, read, write, etc).
(as an aside, I did spend some time pondering ways to adapt Postgres'
nbtree AM to handle this, but I just don't understand how it works)
Once I had Xapian's filesystem i/o encapsulated into a nice little C++
class, I embarked on creating a mini "filesystem" ontop of Postgres'
storage subsystem. In essence, I've now got a Postgres access method
that mirrors the basics of a filesystem, from creating/open files to
reading from and writing to them, in addition to truncation and
deletion.
After that, it was just a matter of the glue code to teach Xapian to
use this "filesystem" for all its filesystem i/o, and voila!, Xapian
works ontop of Postgres' storage subsystem and I didn't have to rewrite
Xapian from scratch. And surprisingly, despite the additional overhead
of this filesystem abstraction layer, it's still very fast... esp. once
Buffers get cached.
I've still got more work to do (like dealing with locking and general
concurrency issues, not to mention bugs I haven't found yet), but it's
working *really* well in a single-user environment.
So here's the important question: How stupid is this?
I've done some benchmarking against tsearch2. Attached are the queries
and execution times on my dual 2gig G5 w/ 2gig ram.
The table contains 51,160 records. It's every text file contained on
my computer (which includes multiple copies of all my java projects).
All told, it's 337,343,569 bytes of data, with an average file size of
6,594 bytes. The Xapian operator is "=>", and tsearch2's operator is
"@@". I ran each query 6 times, and just took the best execution time.
It's also worth noting that my document parser is much different than
tsearch2's. I'm splitting words on non-alphanumerics (and currently am
not using stopwords), and it seems that tsearch2 tries to do something
more intelligent, so the # of results returned vary widely between
tsearch2 and Xapian. I'm not offering an opinion on which way is
"better".
I've got a few more questions about transactions, locking, and a few
other things, but I just thought I'd throw this out as a status report
and to see if there's any kind of reaction.
thanks for your time.
eric
Attachment | Content-Type | Size |
---|---|---|
query_timings.txt | text/plain | 1.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Marc G. Fournier | 2004-01-02 04:28:31 | Re: Mnogosearch (Was: Re: website doc search is ... ) |
Previous Message | Tom Lane | 2004-01-02 04:09:23 | Re: Mnogosearch (Was: Re: website doc search is ... ) |
From | Date | Subject | |
---|---|---|---|
Next Message | Christopher Kings-Lynne | 2004-01-02 04:56:33 | Re: time format |
Previous Message | Tom Lane | 2004-01-02 03:32:24 | Re: [HACKERS] Spinlock support for linux-hppa? |