Re: faster way to calculate top "tags" for a "resource" based on a column

From: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
To: Jonathan Vanasco <postgres(at)2xlp(dot)com>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: faster way to calculate top "tags" for a "resource" based on a column
Date: 2014-10-07 14:02:05
Message-ID: B6F6FD62F2624C4C9916AC0175D56D8828AF19D3@jenmbs01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


>I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one. I'm hoping someone has another strategy.
>
>I have 2 tables:
> resource
> resource_2_tag
>
>I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table.
>
>both tables have around 1.6million records.
>
>If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
>If the database doesn't need to warm up, it averages 1.76seconds.
>
>The 1.76s time is troubling me.
>Searching for the discrete elements of this is pretty lightweight.
>
>here's an explain -- http://explain.depesz.com/s/PndC
>
>I tried a subquery instead of a join, and the query optimized the plan to the same.
>
>i'm hoping someone will see something that I just don't see.
>
>
>
> Table "public.resource_2_tag"
> Column | Type | Modifiers
>-----------------------+---------+-----------
> resource_id | integer |
> tag_id | integer |
>Indexes:
> "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
> "_idx_speed_resource_2_tag__tag_id" btree (tag_id)
>
> Table "public.resource"
> Column | Type | Modifiers
>-------------------------------------+-----------------------------+----------------------------------------------------------
> id | integer | not null default nextval('resource_id_seq'::regclass)
>resource_attribute1_id | integer |
>lots of other columns | |
>Indexes:
> "resource_attribute1_idx" btree (resource_attribute1_id)
>
>--------------------------------------------------------------------------------
>
>select count(*) from resource;
>-- 1669729
>
>select count(*) from resource_2_tag;
>-- 1676594
>
>select count(*) from resource where resource_attribute1_id = 614;
>-- 5184
>-- 4.386ms
>
>select id from resource where resource_attribute1_id = 614;
>-- 5184
>-- 87.303ms
>
>popping the 5k elements into an "in" clause, will run the query in around 100ms.
>
>
>EXPLAIN ANALYZE
>SELECT
> resource_2_tag.tag_id AS resource_2_tag_tag_id,
> count(resource_2_tag.tag_id) AS counted
>FROM
> resource_2_tag
>JOIN resource ON resource.id = resource_2_tag.resource_id
>WHERE
> resource.resource_attribute1_id = 614
>GROUP BY resource_2_tag.tag_id
>ORDER BY counted DESC
>LIMIT 25 OFFSET 0;

Hi,
it seems to me that your subquery may deliver duplicate ids.
And with the selectivity of your example, I would expect an index usage
instead of a table scan. You may check how up to date your statistics are
and try to raise the statistic target on the column resource_2_tag.tag_id.
Also try a CTE form for your query:

WITH A as (SELECT DISTINCT id FROM resource
WHERE resource_attribute1_id = 614
)
SELECT
resource_2_tag.tag_id AS resource_2_tag_tag_id,
count(resource_2_tag.tag_id) AS counted
FROM
resource_2_tag
JOIN A ON A.id = resource_2_tag.resource_id
ORDER BY counted DESC
LIMIT 25;

regards,
Marc Mamin

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sergey Konoplev 2014-10-07 15:29:50 Re: Converting char to varchar automatically
Previous Message Tom Lane 2014-10-07 13:45:25 Re: Really strange foreign key constraint problem blocking delete