From: | Alban Hertroys <haramrae(at)gmail(dot)com> |
---|---|
To: | Abelard Hoffman <abelardhoffman(at)gmail(dot)com> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Will pg_repack improve this query performance? |
Date: | 2014-10-15 09:03:38 |
Message-ID: | E3E6B7B4-C00E-411F-BF9D-C34BA058A4FC@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 15 Oct 2014, at 4:33, Abelard Hoffman <abelardhoffman(at)gmail(dot)com> wrote:
> I believe this query is well optimized, but it's slow if the all the blocks aren't already in memory.
>
> Here's example explain output. You can see it takes over 7 seconds to run when it needs to hit the disk, and almost all of it is related to checking if the user has "messages."
>
> http://explain.depesz.com/s/BLT
From that plan it is obvious that the index scan takes the most time. It looks like you have 3315 rows matching to_id = users.id, of which only 10 match your query conditions after applying the filter.
With your current setup, the database first needs to find candidate rows in the index and then has to check the other conditions against the table, which is likely to involve some disk access.
> On a second run, it's extremely fast (< 50ms). So I'm thinking it's a lack of clustering on the "Index Cond: (to_id = users.user_id)" that's the culprit.
That probably means that the relevant parts of the table were still in memory, which means the scan did not need to visit the disk to load the matched rows to filter the NULL conditions in your query.
> I'm afraid of using CLUSTER due to the exclusive lock, but I found pg_repack while researching:
> http://reorg.github.io/pg_repack/
A CLUSTER would help putting rows with the same to_id together. Disk access would be less random that way, so it would help some.
According to your query plan, accessing disks (assuming that’s what made the difference) was 154 (7700 ms / 50 ms) times slower than accessing memory. I don’t have the numbers for your disks or memory, but that doesn’t look like an incredibly unrealistic difference. That begs the question, how random was that disk access and how much can be gained from clustering that data?
Did you try a partial index on to_id with those NULL conditions? That should result in a much smaller index size, which in turn makes it faster to scan - much so if the index is difficult to keep in memory because of its size. More importantly though, the scan wouldn’t need to visit the table to verify those NULL fields.
> Does it seem likely that doing an --order-by on the to_id column would have a significant impact in this case? pg_repack seems pretty stable and safe at this point?
Not being familiar with pg_repack I can’t advise on that.
> I am going to try and test this in a dev environment first but wanted feedback if this seemed like a good direction?
You can try that CLUSTER or the approach with pg_repack regardless of my suggestion for the partial index. It should speed disk access to those records up regardless of how they are indexed.
Good luck!
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
From | Date | Subject | |
---|---|---|---|
Next Message | Léa Massiot | 2014-10-15 10:18:58 | How to start several PostgreSQL clusters at boot time on a Debian OS |
Previous Message | Albe Laurenz | 2014-10-15 08:35:18 | Re: copying a large database to change encoding |