Re: Will pg_repack improve this query performance?

From: Abelard Hoffman <abelardhoffman(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Will pg_repack improve this query performance?
Date: 2014-10-16 21:41:11
Message-ID: CACEJHMipcSV+mUAV2U2Gew1N2ec6snLsN8HqLsKgfKZPBikXxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Oct 15, 2014 at 2:03 AM, Alban Hertroys <haramrae(at)gmail(dot)com> wrote:

>
> On 15 Oct 2014, at 4:33, Abelard Hoffman <abelardhoffman(at)gmail(dot)com> wrote:
>
> > I believe this query is well optimized, but it's slow if the all the
> blocks aren't already in memory.
> >
> > Here's example explain output. You can see it takes over 7 seconds to
> run when it needs to hit the disk, and almost all of it is related to
> checking if the user has "messages."
> >
> > http://explain.depesz.com/s/BLT
>
> From that plan it is obvious that the index scan takes the most time. It
> looks like you have 3315 rows matching to_id = users.id, of which only 10
> match your query conditions after applying the filter.
>
> With your current setup, the database first needs to find candidate rows
> in the index and then has to check the other conditions against the table,
> which is likely to involve some disk access.
>
> > On a second run, it's extremely fast (< 50ms). So I'm thinking it's a
> lack of clustering on the "Index Cond: (to_id = users.user_id)" that's the
> culprit.
>
> That probably means that the relevant parts of the table were still in
> memory, which means the scan did not need to visit the disk to load the
> matched rows to filter the NULL conditions in your query.
>
> > I'm afraid of using CLUSTER due to the exclusive lock, but I found
> pg_repack while researching:
> > http://reorg.github.io/pg_repack/
>
> A CLUSTER would help putting rows with the same to_id together. Disk
> access would be less random that way, so it would help some.
>
> According to your query plan, accessing disks (assuming that’s what made
> the difference) was 154 (7700 ms / 50 ms) times slower than accessing
> memory. I don’t have the numbers for your disks or memory, but that doesn’t
> look like an incredibly unrealistic difference. That begs the question, how
> random was that disk access and how much can be gained from clustering that
> data?
>
> Did you try a partial index on to_id with those NULL conditions? That
> should result in a much smaller index size, which in turn makes it faster
> to scan - much so if the index is difficult to keep in memory because of
> its size. More importantly though, the scan wouldn’t need to visit the
> table to verify those NULL fields.
>

No, I haven't tried a more constrained index. Good point, makes sense.

> > Does it seem likely that doing an --order-by on the to_id column would
> have a significant impact in this case? pg_repack seems pretty stable and
> safe at this point?
>
> Not being familiar with pg_repack I can’t advise on that.
>
> > I am going to try and test this in a dev environment first but wanted
> feedback if this seemed like a good direction?
>
> You can try that CLUSTER or the approach with pg_repack regardless of my
> suggestion for the partial index. It should speed disk access to those
> records up regardless of how they are indexed.
>

I tried pg_repack in dev and it did make a dramatic improvement (pg_repack
took ~ 65 minutes to run). After the repack, I couldn't get the query to
take longer than 750ms. Should be much, much faster in production too.

It seems like maybe the partial index is a better long-term fix though.

And thank you, Josh, about the tip on table bloat. I'll take a look at that
too.

--
Best,
AH

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2014-10-16 22:44:21 Re: Weird library problem
Previous Message Josh Kupershmidt 2014-10-16 20:32:44 Re: Will pg_repack improve this query performance?