Re: faster way to calculate top "tags" for a "resource" based on a column

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Jonathan Vanasco <postgres(at)2xlp(dot)com>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: faster way to calculate top "tags" for a "resource" based on a column
Date: 2014-10-06 21:56:19
Message-ID: 54331003.1020505@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/3/14, 11:21 AM, Jonathan Vanasco wrote:
> I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one. I'm hoping someone has another strategy.
>
> I have 2 tables:
> resource
> resource_2_tag
>
> I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table.
>
> both tables have around 1.6million records.
>
> If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk.
> If the database doesn't need to warm up, it averages 1.76seconds.
>
> The 1.76s time is troubling me.
> Searching for the discrete elements of this is pretty lightweight.
>
> here's an explain -- http://explain.depesz.com/s/PndC
>
> I tried a subquery instead of a join, and the query optimized the plan to the same.
>
> i'm hoping someone will see something that I just don't see.
>
>
>
> Table "public.resource_2_tag"
> Column | Type | Modifiers
> -----------------------+---------+-----------
> resource_id | integer |
> tag_id | integer |
> Indexes:
> "_idx_speed_resource_2_tag__resource_id" btree (resource_id)
> "_idx_speed_resource_2_tag__tag_id" btree (tag_id)
>
> Table "public.resource"
> Column | Type | Modifiers
> -------------------------------------+-----------------------------+----------------------------------------------------------
> id | integer | not null default nextval('resource_id_seq'::regclass)
> resource_attribute1_id | integer |
> lots of other columns | |
> Indexes:
> "resource_attribute1_idx" btree (resource_attribute1_id)
>
> --------------------------------------------------------------------------------
>
> select count(*) from resource;
> -- 1669729
>
> select count(*) from resource_2_tag;
> -- 1676594
>
> select count(*) from resource where resource_attribute1_id = 614;
> -- 5184
> -- 4.386ms
>
> select id from resource where resource_attribute1_id = 614;
> -- 5184
> -- 87.303ms
>
> popping the 5k elements into an "in" clause, will run the query in around 100ms.
>
>
> EXPLAIN ANALYZE
> SELECT
> resource_2_tag.tag_id AS resource_2_tag_tag_id,
> count(resource_2_tag.tag_id) AS counted
> FROM
> resource_2_tag
> JOIN resource ON resource.id = resource_2_tag.resource_id
> WHERE
> resource.resource_attribute1_id = 614
> GROUP BY resource_2_tag.tag_id
> ORDER BY counted DESC
> LIMIT 25 OFFSET 0;
Don't join to the resource table; there's no reason to because you're not pulling anything from it.

If for some reason you do need data out of the resource table, do the LIMIT 25 first, in a sub-select:

SELECT r.*, counted
FROM resource r
JOIN (
SELECT tag_id, count(*)
FROM resource_2_tag
GROUP BY tag_id
ORDER BY tag_id
LIMIT 25
) t ON ...
;

--
--
Jim Nasby, Data Architect, Blue Treble
Data in Trouble? Get it in Treble! http://BlueTreble.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jim Nasby 2014-10-06 22:00:12 Re: Processor usage/tuning question
Previous Message Jim Nasby 2014-10-06 21:47:26 Re: table versioning approach (not auditing)