Re: Query on indexed table too slow

From: Fábio Moreira <fabio(at)dias(dot)moreira(dot)nom(dot)br>
To: Shmagi Kavtaradze <kavtaradze(dot)s(at)gmail(dot)com>
Cc: pgsql-novice novice <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Query on indexed table too slow
Date: 2016-03-08 00:20:22
Message-ID: CANQddpOTBZz+HJm0ttugYPAMKmyro6OA+YWNe0XrhYkvydN9Sg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Hi Shmagi,

Your table is tiny -- tiny enough that the actual working set (the rows of
chunks2, without the "doc" field) easily fits in memory; you're unlikely to
get a large improvement here, specially considering how, by the looks of
the query planner, you have relatively few distinct chunkid (is that so?
can you post SELECT COUNT(DISTINCT chunkid) FROM chunks2?).

Two things come to mind here:

1) that similarity comparison you're doing looks expensive, with a cast to
TEXT and string manipulation and what not. I'm not sure if PostgreSQL has a
native bit count method to speed this part of the code up; you could
implement one in C to see if that improves things, but it might be easier
(if indeed you only have 10 bits) to do instead:

CREATE TEMPORARY TABLE popcount AS
SELECT
i::BIT(10) AS value,
Length(Replace(Cast(i::BIT(10) AS TEXT), '0', '')) AS popcount
FROM
generate_series(0, 1023, 1) AS temp(i);

and join against that table (again, it's tiny, so no real need for an index
here).

2) do you really need all 100M rows of the output? I assume this is some
document similarity metric you're after? What will be done with this data
after it is generated?

[]s, Fábio.

On Tue, Mar 8, 2016 at 2:12 AM, Shmagi Kavtaradze <kavtaradze(dot)s(at)gmail(dot)com>
wrote:

> I have a table with 46230 rows(1 doc, 4623 sentences and 10 chunks for
> each sentence):
>
> create table chunks(
> doc varchar,
> sentenceid int,
> chunkid int,
> chunk bit(10)
> );
>
> With the query I want to compare sentence chunks with other sentence
> chunks that have same chunkid:
>
> SELECT
> a.sentenceid,
> b.sentenceid, a.chunkid,
> Length(Replace(Cast(a.chunk & b.chunk AS TEXT), '0', ''))::float /
> Length(a.chunk)::float
> FROM chunks2 a
> INNER JOIN chunks2 b
> ON a.sentenceid < b.sentenceid and a.chunkid = b.chunkid;
>
> I ran explain analyze on unindexed table, composite index and both indexed
> separately,but time is the same for all:
>
> Indexed on (sentenceid, chunkid):
>
> Hash Join (cost=1335.17..4549476.28 rows=71249559 width=26) (actual
> time=144.376..1156178.110 rows=106837530 loops=1)
> Hash Cond: (a.chunkid = b.chunkid)
> Join Filter: (a.sentenceid < b.sentenceid)
> Rows Removed by Join Filter: 106883760
> -> Seq Scan on chunks2 a (cost=0.00..757.30 rows=46230 width=15)
> (actual time=0.039..77.275 rows=46230 loops=1)
> -> Hash (cost=757.30..757.30 rows=46230 width=15) (actual
> time=142.954..142.954 rows=46230 loops=1)
> Buckets: 65536 Batches: 1 Memory Usage: 2680kB
> -> Seq Scan on chunks2 b (cost=0.00..757.30 rows=46230
> width=15) (actual time=0.031..64.340 rows=46230 loops=1)
> Planning time: 1.209 ms
> Execution time: 1212779.012 ms
>
>
> I know they have the same operations and no index was used. Where is my
> mistake and how to speed up query with index? Or how to use indexes
> efficiently in my case?
>
>

--
Fábio Dias Moreira

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Jozef Riha 2016-03-08 22:42:25 Re: help needed with error during upgrade (9.4 -> 9.5)
Previous Message Shmagi Kavtaradze 2016-03-07 17:12:29 Query on indexed table too slow