From: | Dann Corbit <DCorbit(at)connx(dot)com> |
---|---|
To: | 'Sergey Konoplev' <gray(dot)ru(at)gmail(dot)com>, Janek Sendrowski <janek12(at)web(dot)de> |
Cc: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Fastest Index/Algorithm to find similar sentences |
Date: | 2013-07-30 05:15:06 |
Message-ID: | 87F42982BF2B434F831FCEF4C45FC33E64F16BD6@EXCHANGE.corporate.connx.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I worked on a library project once that needed to perform similarity searches.
The first thing needed was to construct a word dictionary where there was a number corresponding to each word.
1, 'aardvark'
...
99999, 'zygote'
Then you need a list of stop words like 'AND', 'THE':
https://en.wikipedia.org/wiki/Stop_words
Then, you write a sentence parser that turns words into their numbers
So now, a bibliography entry (for example) will be a vector of numbers.
You can query with things like wordcount, word x NEAR word y, etc.
If the database supports it, you can also query with bitmap indexes.
I have not used the PostgreSQL bitmap indexes much, but they look like they might be quite useful:
http://wiki.postgresql.org/wiki/Bitmap_Indexes
We used something called ALA library parsing rules that stripped off special characters, made capitalization uniform, etc.
http://www.ala.org/tools/guidelines/standardsguidelines
Something like this project was the outcome:
http://www.ala.org/lita/ital/21/4/su
You might look into library software. Maybe you can find something useful here:
http://www.loc.gov/marc/marctools.html
I see that there are some sourceforge MARC record projects:
http://sourceforge.net/directory/os:windows/freshness:recently-updated/?q=marc%20records
From | Date | Subject | |
---|---|---|---|
Next Message | Luca Ferrari | 2013-07-30 06:12:27 | Re: to know what columns are getting updated |
Previous Message | Sajeev Mayandi | 2013-07-30 04:05:16 | to know what columns are getting updated |