Re: Autovacuum issues with truncate and create index ...

From: Baptiste LHOSTE <blhoste(at)alaloop(dot)com>
To: Kevin Grittner <kgrittn(at)mail(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org, Sylvain CAILLET <scaillet(at)alaloop(dot)com>
Subject: Re: Autovacuum issues with truncate and create index ...
Date: 2012-12-31 16:20:26
Message-ID: 1969340798.3975703.1356970826505.JavaMail.root@alaloop.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi,

> Thanks. I wasn't suggesting you increase the duration; I just
> wanted perspective on whether it could be the result of unusually
> long run times rather than blocking, and how severe that increase
> was known ot be.

> Thank you very much, With that much information we should be much
> better able to get a sense of the nature of the problem. It will
> take a while to sift through it and properly analyze it. But even
> on a first pass I think there is a lot of valuable information that
> jumps out:

> (1) There is no sign of unusual pressure on OS memory; OS cache
> usage remains stable from before the incident to the end of the
> monitored period.

> (2) Disk wait time climbed early in the incident and remained high
> to the end.

Just for information, our server has 4 disk SAS in raid 0. I do not remember if I have already mention it.

> (3) Disk read volume *before* the incident shows a peak of about
> the same as during the incident, with somewhat lower disk wait
> time. (Do we know what was causing that? It ended before the other
> results were captured.)

Maybe it's the consequence of the step 4 of our process : "we copy new data (around 150 000 rows) in the target partition"
These data have been read from files then have been copied to the database.
That could explain the peak on the disk read volume.

> (4) Not a single incident of blocking was captured in any of the
> lists of pg_stat_activity or pg_locks.

> (5) The TRUNCATE statements completed and moved on to CREATE INDEX,
> which continued to accrue CPU time during the episode.

> (6) Of the three autovacuum worker processes, two were running just
> an ANALYZE on every sample, and were moving on from table to table.

> (7) One autovacuum process was running VACUUM ANALYZE against a
> single table for the entire duration of the incident. It was slowly
> accumulating CPU time during the incident.

> On the face of it, it appears that with your normal production
> settings your storage system is right at the edge of what it can
> handle, and making autovacuum more aggressive to try to keep the
> statistics on the second type of table more up-to-date is pushing
> the load past its capacity. You might be able to change the
> autovacuum thresholds and scale factors without changing
> autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit (or
> making smaller changes to them). You could probably throw hardware
> at it to fix the problem. Even with settings which work fine when
> everything is up-to-date you may experience some impact on
> production when you frist turn it on and autovacuum is attempting
> to "catch up".

We will try to change autovacuum thresholds and scale factors without changing autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit.

> I'm not actually clear, when I look back, at what the problem is
> that you are trying to solve -- you say that a particular type of
> query is running for 2 to 3 minutes, and note that statistics on a
> particular type of table are only being re-sampled once every 5 to
> 6 days. It's not clear that more frequent statistical sampling of
> the tables would change the plans. Perhaps you should post one such
> query to the performance list, with supporting data, and see
> whether someone can suggest a way to speed that query.

> http://wiki.postgresql.org/wiki/SlowQueryQuestions

These queries are very simple : delete from table where start_date < availableTimestamp.
We performed an EXPLAIN to try to understand what could be the problem.
The query planner said that the index on start_date could not be used because it was not up-to-date.
That why we decided to increase the frequency of the autovacuum process.

How a server (8 CPUs) which has a 0.56 load over the last 15 minutes could not handle 3 autovacuum processes, for me it is very confusing.

Thanks anyway for your time and help.

Happy New Year's Eve.

Best regards,

Baptiste

---
Baptiste LHOSTE
blhoste(at)alaloop(dot)com

ALALOOP S.A.S. - Technopole Izarbel - 64210 Bidart
Téléphone : +33 (0) 5 59 41 51 10
www.alaloop.com

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Kevin Grittner 2012-12-31 16:41:07 Re: Autovacuum issues with truncate and create index ...
Previous Message Kevin Grittner 2012-12-30 18:35:59 Re: Autovacuum issues with truncate and create index ...