Re: Understanding query planner cpu usage

From: Lucas Fairchild-Madar <lucas(dot)madar(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Understanding query planner cpu usage
Date: 2018-02-22 00:53:15
Message-ID: CAJmoq7NtboWu13rKznLgjSkud-soQwvth3XVgxyG2-psg5Mx-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Feb 21, 2018 at 4:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Lucas Fairchild-Madar <lucas(dot)madar(at)gmail(dot)com> writes:
> > I'm having an perplexing issue in PG 10.1 wherein deleting a large amount
> > of rows from a table causes query planning time to spike dramatically
> for a
> > while. This happens with or without autovacuums so vacuuming isn't the
> > issue.
>
> Would the deleted rows happen to be the extremal values of some indexed
> column that is a join key in the slowly-planned queries?
>

Yes. Basically, this is sort of a moving set of several weeks of data. We
have a job that goes through (currently, every 3 hours) and periodically
wipes out data from the very beginning. At the same time, more data is
continually being appended to the table. All of these rows are immutable.
The vast majority of the work happening in the database happens in the last
two or three million rows of these tables; the rest is just there for
auditing and analytics.

> If so, this might be some manifestation of a problem we've seen before:
> the planner tries to find out the current live max value of the column
> by scanning the index, and that's really slow if there are a lot of
> recently-dead entries at the index end, because each of them has to be
> inspected and then hinted dead. You'd pay that overhead at some point
> anyway, of course. The cases where it becomes a problem are where the
> planner inspects these values but *can't* hint them dead, such as when
> the deletion hasn't committed yet, or they're newly inserted rows that
> likewise aren't committed. Then each incoming query does the work
> over again until the transient state is resolved.
>
> We've done various things to ameliorate this, but maybe you've found
> some new way to cause it to be a pain point. Is there anything special
> about the way you're deleting the rows? Maybe there's a long-running
> transaction in the background that can still see the deleted rows?
>
>
This sounds like the trigger here. We have a long-running (~20 minute)
transaction that's doing analytical rollups on these sets of tables. It's
doing them only on very recent data, but I assume having an open
transaction means the tuples can't be marked as dead yet because the
queries within that transaction could still potentially access them.

I took a look further in this direction and found that our cleanup process
and this analytical rollup both happened to run very close to each other
twice a day. I moved the cleanup process to run at a time when it should
never collide with analytical rollups and we'll see if the load spike
happens again. When this happens, the long-running query takes almost an
hour so the problem compounds.

It would be helpful to have a way to introspect in to what the query
planner is doing. For instance, if I could explain (analyze, verbose,
queryplan) the query and the queryplan told me that it spent 5000ms
skipping over dead tuples in the index, knowing that would arm me with more
information to tune the system better.

Perhaps this is a stupid question, but would creating the index in
descending order solve this problem (as a unique index, not a primary key)?
What is the planner doing when trying to find the current live max value of
the column?

Thanks,
Lucas

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Steele 2018-02-22 00:56:38 Re: initdb when data/ folder has mount points
Previous Message Tom Lane 2018-02-22 00:47:16 Re: initdb when data/ folder has mount points