From: | tim(dot)child(at)comcast(dot)net |
---|---|
To: | Shmagi Kavtaradze <kavtaradze(dot)s(at)gmail(dot)com> |
Cc: | pgsql-novice(at)postgresql(dot)org |
Subject: | Re: Combine Top-k with similarity search extensions |
Date: | 2015-11-20 16:42:44 |
Message-ID: | 1386988458.1073161.1448037764581.JavaMail.zimbra@comcast.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
OK its does add complexity,
Here is a functional md5 index on the whole string
drop table if exists text_table;
create table text_table
(
mystring text
);
create index text_md5 on text_table(md5(mystring));
insert into text_table (mystring) values
('John Smith'), ('John Smith'), ('John Smith'), ('John Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Ian Smith'), ('Jim Smith'), ('J Smith');
select md5, count from
( select md5(mystring) md5 , count(*) count from text_table group by md5(mystring) ) subq
where count > 1 ;
----- Original Message -----
From: "Shmagi Kavtaradze" <kavtaradze(dot)s(at)gmail(dot)com>
To: "tim child" <tim(dot)child(at)comcast(dot)net>
Cc: pgsql-novice(at)postgresql(dot)org
Sent: Friday, November 20, 2015 8:13:15 AM
Subject: Re: [NOVICE] Combine Top-k with similarity search extensions
It will add complexity and also no idea how to do it. Is there any alternative?
On Fri, Nov 20, 2015 at 5:00 PM, < tim(dot)child(at)comcast(dot)net > wrote:
Shmagi,
Take the first 20 text characters and compute and store the CRC32 or MD5 of that value. That value acts as a signature. You can then find all distinct signatures, or all rows with duplicate signatures for further analysis You could event try building a signature on the full text string.
From: "Shmagi Kavtaradze" < kavtaradze(dot)s(at)gmail(dot)com >
To: pgsql-novice(at)postgresql(dot)org
Sent: Friday, November 20, 2015 2:21:36 AM
Subject: [NOVICE] Combine Top-k with similarity search extensions
I am performing similarity check over a column in a table with about 3500 entries. Column is populated with text data from text file. Performing a check results in 3500 * 3500 rows and it takes forever to calculate for my virtual machine. Is there any way to calculate for top-k results, to decrease amount and time needed? What I mean is that, for example when checking two sentences, if first several words does not match, to stop checking that sentences and move on.
From | Date | Subject | |
---|---|---|---|
Next Message | =?UTF-8?Q?=E9=A9=AC=E4=BF=AE/=D0=9C=D0=B0=D1=82=D0=B2=D0=B5=D0=B9/Mateo/M?==?UTF-8?Q?att=20Buse?= | 2015-11-21 02:00:10 | Re: Combine Top-k with similarity search extensions |
Previous Message | Shmagi Kavtaradze | 2015-11-20 16:13:15 | Re: Combine Top-k with similarity search extensions |