Re: Vacuum Full Analyze Stalled

From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, Jeff Kirby <Jeff(dot)Kirby(at)wicourts(dot)gov>, pgsql-admin(at)postgresql(dot)org
Subject: Re: Vacuum Full Analyze Stalled
Date: 2005-10-03 19:41:14
Message-ID: 20051003194114.GD40138@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-hackers

I haven't looked in detail at autovacuum in 8.1 yet, but I know that in
previous versions it was a bad idea to depend on it to vacuum a small
table that has a lot of update activity frequently enough. The issue is
that since it's single-threaded if it starts a vacuum on a large table
it could be hours before it gets around to vacuuming the small table
again.

In a case like this, your best bet is to either have the application
periodically vacuum the small table or just do it once a minute via
cron. It should take next to no time to vacuum a 72 row table, so
running vacuum once a minute shouldn't really present much overhead when
the vacuum isn't needed.

On Mon, Oct 03, 2005 at 02:09:37PM -0500, Kevin Grittner wrote:
> The previous report was from the same databases -- I'm a consultant
> tasked with evaluating the various open source options and making
> one of them work with our existing framework. Jeff developed a new
> app (which is in beta testing) which is our first use of PostgreSQL
> with real production load for prolonged periods of time.
>
> Jeff was monitoring the app over the weekend, and posted here when
> he had problems. Unfortunately, he didn't capture enough information
> to diagnose much, but I'm passing along the sequence of events
> since they lead up to a failure this morning for which we have more
> detail.
>
> Saturday, 9:00 a.m. -- Noticed performace beginning to degrade.
> Saturday, 1:00 p.m. -- After continued degradation, stops client app,
> and runs vacuum full analyze on both Windows and Linux boxes.
> Saturday, 3:00 p.m. -- Windows box completes the vacuum.
> Sunday, 1:00 p.m. -- Vacuum on Linux still not done. CPU on Linux
> box is completely idle. Killed the vacuum process. Ran vacuum
> full analyze on each table individually. All fine except for the 72 row
> table with very frequent updates. Left this one running.
> Sunday, 9:00 p.m. -- Vacuum of 72 row table still running. Killed all
> processes and restarted postgres. Started a vacuum full analyze
> of the 72 row table again.
> Sunday, 11:00 p.m. -- Vacuum of 72 row table finishes. Started
> applications, which had been stopped since Saturday, 1:00 p.m.
> Monday, 9:20 a.m. -- Performance is beginning to degrade again.
> Detailed information from this point below.
>
> dtr=# select * from pg_locks order by pid;
> locktype | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid | mode | granted
> ---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
> transactionid | | | | | 36430565 | | | | 36430565 | 2771 | ExclusiveLock | t
> relation | 30793 | 30976 | | | | | | | 36430565 | 2771 | AccessShareLock | t
> relation | 30793 | 30976 | | | | | | | 36430565 | 2771 | RowExclusiveLock | t
> relation | 30793 | 30871 | | | | | | | 36430565 | 2771 | AccessShareLock | t
> relation | 30793 | 30871 | | | | | | | 36430565 | 2771 | RowExclusiveLock | t
> relation | 30793 | 30976 | | | | | | | 36430561 | 2777 | AccessShareLock | t
> relation | 30793 | 30976 | | | | | | | 36430561 | 2777 | RowExclusiveLock | t
> transactionid | | | | | 36430561 | | | | 36430561 | 2777 | ExclusiveLock | t
> relation | 30793 | 30871 | | | | | | | 36430561 | 2777 | AccessShareLock | t
> relation | 30793 | 30871 | | | | | | | 36430561 | 2777 | RowExclusiveLock | t
> transactionid | | | | | 36430563 | | | | 36430563 | 2783 | ExclusiveLock | t
> relation | 30793 | 30871 | | | | | | | 36430563 | 2783 | AccessShareLock | t
> relation | 30793 | 30871 | | | | | | | 36430563 | 2783 | RowExclusiveLock | t
> relation | 30793 | 30976 | | | | | | | 36430563 | 2783 | AccessShareLock | t
> relation | 30793 | 30976 | | | | | | | 36430563 | 2783 | RowExclusiveLock | t
> relation | 30793 | 30976 | | | | | | | 36430559 | 2790 | AccessShareLock | t
> relation | 30793 | 30976 | | | | | | | 36430559 | 2790 | RowExclusiveLock | t
> relation | 30793 | 30871 | | | | | | | 36430559 | 2790 | AccessShareLock | t
> relation | 30793 | 30871 | | | | | | | 36430559 | 2790 | RowExclusiveLock | t
> transactionid | | | | | 36430559 | | | | 36430559 | 2790 | ExclusiveLock | t
> relation | 30793 | 30976 | | | | | | | 35633343 | 3158 | ShareUpdateExclusiveLock | t
> transactionid | | | | | 35633343 | | | | 35633343 | 3158 | ExclusiveLock | t
> relation | 30793 | 30871 | | | | | | | 35633343 | 3158 | ShareUpdateExclusiveLock | t
> relation | 30793 | 10340 | | | | | | | 36430567 | 4238 | AccessShareLock | t
> transactionid | | | | | 36430567 | | | | 36430567 | 4238 | ExclusiveLock | t
> (25 rows)
>
> dtr=# \q
> postgres(at)linpost:/var/pgsql/data> ps aux|grep ^postgres
> postgres 28937 0.0 0.0 2436 992 pts/1 S Sep30 0:00 su postgres
> postgres 28938 0.0 0.0 2944 1620 pts/1 S Sep30 0:00 bash
> postgres 2612 0.0 0.0 170968 4952 ? S Oct02 0:00 /usr/local/pgsql/bin/postmaster
> postgres 2614 0.0 1.9 171200 165528 ? S Oct02 0:08 postgres: writer process
> postgres 2615 0.0 0.0 7240 2788 ? S Oct02 0:27 postgres: stats buffer process
> postgres 2617 0.0 0.0 6556 1988 ? S Oct02 0:22 postgres: stats collector process
> postgres 2724 0.0 0.0 171828 7156 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32986) idle
> postgres 2771 6.4 2.0 171996 167176 ? S Oct02 44:33 postgres: dtr dtr 127.0.0.1(32990) idle
> postgres 2773 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32991) idle
> postgres 2774 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32992) idle
> postgres 2775 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32993) idle
> postgres 2776 0.0 0.0 171720 6848 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32994) idle
> postgres 2777 6.4 2.0 171996 167228 ? R Oct02 44:29 postgres: dtr dtr 127.0.0.1(32995) UPDATE
> postgres 2778 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32996) idle
> postgres 2779 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32997) idle
> postgres 2780 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32998) idle
> postgres 2781 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(32999) idle
> postgres 2782 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(33000) idle
> postgres 2783 6.4 2.0 172016 167292 ? S Oct02 44:38 postgres: dtr dtr 127.0.0.1(33001) idle
> postgres 2784 0.0 0.0 171460 5824 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(33002) idle
> postgres 2785 0.0 0.0 171460 5828 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(33003) idle
> postgres 2786 6.4 2.0 172000 167216 ? S Oct02 44:38 postgres: dtr dtr 127.0.0.1(33004) idle
> postgres 2787 6.4 2.0 172000 167236 ? R Oct02 44:33 postgres: dtr dtr 127.0.0.1(33005) UPDATE
> postgres 2788 0.0 0.0 171460 5828 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(33006) idle
> postgres 2789 0.0 0.0 171460 5828 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(33007) idle
> postgres 2790 6.4 2.0 171992 167280 ? R Oct02 44:27 postgres: dtr dtr 127.0.0.1(33008) UPDATE
> postgres 2791 0.0 0.0 171460 5828 ? S Oct02 0:00 postgres: dtr dtr 127.0.0.1(33009) idle
> postgres 3158 0.0 0.0 188324 8044 ? S Oct02 0:00 postgres: autovacuum process dtr
> postgres 4289 0.0 0.0 2148 672 pts/1 R+ 09:21 0:00 ps aux
> postgres 4290 0.0 0.0 1688 688 pts/1 R+ 09:21 0:00 grep ^postgres
>
> oid 30871 is the 72 row table.
> oid 30976 is the index for its primary key.
>
> We stopped the applications, and the autovacuum process didn't
> wake up.
>
> dtr=# select * from pg_locks order by pid;
> locktype | database | relation | page | tuple | transactionid | classid | objid | objsubid | transaction | pid | mode | granted
> ---------------+----------+----------+------+-------+---------------+---------+-------+----------+-------------+------+--------------------------+---------
> relation | 30793 | 30976 | | | | | | | 35633343 | 3158 | ShareUpdateExclusiveLock | t
> transactionid | | | | | 35633343 | | | | 35633343 | 3158 | ExclusiveLock | t
> relation | 30793 | 30871 | | | | | | | 35633343 | 3158 | ShareUpdateExclusiveLock | t
> relation | 30793 | 10340 | | | | | | | 36600517 | 4498 | AccessShareLock | t
> transactionid | | | | | 36600517 | | | | 36600517 | 4498 | ExclusiveLock | t
> (5 rows)
>
> dtr=# \q
> postgres(at)linpost:/var/pgsql/data> ps aux|grep ^postgres
> postgres 2612 0.0 0.0 170968 4952 ? S Oct02 0:00 /usr/local/pgsql/bin/postmaster
> postgres 2614 0.0 1.9 171200 165528 ? S Oct02 0:10 postgres: writer process
> postgres 2615 0.0 0.0 7240 2788 ? S Oct02 0:32 postgres: stats buffer process
> postgres 2617 0.0 0.0 6556 1988 ? S Oct02 0:26 postgres: stats collector process
> postgres 3158 0.0 0.0 188324 8044 ? S Oct02 0:00 postgres: autovacuum process dtr
> postgres 4481 0.0 0.0 2432 988 pts/1 S 10:48 0:00 su postgres
> postgres 4482 0.0 0.0 2944 1604 pts/1 S 10:48 0:00 bash
> postgres 4499 0.0 0.0 2148 672 pts/1 R+ 10:49 0:00 ps aux
> postgres 4500 0.0 0.0 1680 636 pts/1 S+ 10:49 0:00 grep ^postgres
> postgres(at)linpost:/var/pgsql/data> uptime
> 10:48am up 6 days 19:20, 2 users, load average: 0.00, 0.28, 0.73
> postgres(at)linpost:/var/pgsql/data> vmstat 10
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 0 8 278376 62676 7770684 0 0 15 7 17 18 11 1 87 1
> 0 0 8 278376 62700 7770660 0 0 0 11 1035 0 1 0 99 0
> 0 0 8 278440 62700 7770660 0 0 0 0 1013 0 0 0 100 0
> 0 0 8 278440 62700 7770660 0 0 0 0 1016 0 1 0 99 0
> 0 0 8 278440 62700 7770660 0 0 0 4 1018 0 1 0 99 0
>
> We found the 72 row table somewhat bloated, so we cleaned it up.
>
> dtr=# vacuum analyze verbose "DbTranImageStatus";
> INFO: vacuuming "public.DbTranImageStatus"
> INFO: index "DbTranImageStatusPK" now contains 72 row versions in 6665 pages
> DETAIL: 6555 index pages have been deleted, 6555 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.01 sec.
> INFO: "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 479137 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.01s/0.01u sec elapsed 0.02 sec.
> INFO: vacuuming "pg_toast.pg_toast_272793"
> INFO: index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL: 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: analyzing "public.DbTranImageStatus"
> INFO: "DbTranImageStatus": scanned 3000 of 4244 pages, containing 68 live rows and 0 dead rows; 68 rows in sample, 96 estimated total rows
> VACUUM
> dtr=# vacuum full analyze verbose "DbTranImageStatus";
> INFO: vacuuming "public.DbTranImageStatus"
> INFO: "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 4244 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 68 to 68 bytes long.
> There were 479137 unused item pointers.
> Total free space (including removable row versions) is 32760236 bytes.
> 4227 pages are or will become empty, including 0 at the end of the table.
> 4244 pages containing 32760236 free bytes are potential move destinations.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
> DETAIL: 0 index row versions were removed.
> 6544 index pages have been deleted, 6544 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: "DbTranImageStatus": moved 22 row versions, truncated 4244 to 1 pages
> DETAIL: CPU 0.00s/0.02u sec elapsed 0.03 sec.
> INFO: index "DbTranImageStatusPK" now contains 72 row versions in 6654 pages
> DETAIL: 22 index row versions were removed.
> 6544 index pages have been deleted, 6544 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: vacuuming "pg_toast.pg_toast_272793"
> INFO: "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> Nonremovable row versions range from 0 to 0 bytes long.
> There were 0 unused item pointers.
> Total free space (including removable row versions) is 0 bytes.
> 0 pages are or will become empty, including 0 at the end of the table.
> 0 pages containing 0 free bytes are potential move destinations.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL: 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: analyzing "public.DbTranImageStatus"
> INFO: "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72 estimated total rows
> VACUUM
> dtr=# reindex table "DbTranImageStatus";
> REINDEX
> dtr=# vacuum analyze verbose "DbTranImageStatus";
> INFO: vacuuming "public.DbTranImageStatus"
> INFO: index "DbTranImageStatusPK" now contains 72 row versions in 2 pages
> DETAIL: 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: "DbTranImageStatus": found 0 removable, 72 nonremovable row versions in 1 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 48 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: vacuuming "pg_toast.pg_toast_272793"
> INFO: index "pg_toast_272793_index" now contains 0 row versions in 1 pages
> DETAIL: 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: "pg_toast_272793": found 0 removable, 0 nonremovable row versions in 0 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: analyzing "public.DbTranImageStatus"
> INFO: "DbTranImageStatus": scanned 1 of 1 pages, containing 72 live rows and 0 dead rows; 72 rows in sample, 72 estimated total rows
> VACUUM
>
> These all ran sub-second. We updated postgresql.conf for more
> aggressive autovacuum, and restarted postgres, then restarted the
> application.
>
> This failure seems rather different from the one I previously posted,
> since it was blocking on the application table, rather than
> pg_constraint_contypid_index, and it did not wake up when all
> other processes where stopped.
>
> As before, both boxes are running 8.1beta2. Windows is not showing
> these problems with the autovacuum blocking; just Linux.
>
> -Kevin
>
>
> >>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> 10/02/05 8:53 PM >>>
> "Jeff Kirby" <Jeff(dot)Kirby(at)wicourts(dot)gov> writes:
> > the Linux box however is still chugging away this morning... and
> > appears to be stuck on vacuuming "pg_constraint_contypid_index". How
> > do I know... well I don't really... I'm inferring based on the order
> > of the log output on the Windows box.
>
> Looking in pg_locks would give you a more reliable indicator of what the
> VACUUM is currently working on.
>
> Is the Linux box otherwise idle? There was another report recently of a
> vacuum hung up for a long time on pg_constraint_contypid_index, but it
> seemed to be due to extremely heavy usage of domain types ...
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Alvaro Herrera 2005-10-03 20:12:25 Re: Vacuum Full Analyze Stalled
Previous Message Kevin Grittner 2005-10-03 19:38:56 Re: Vacuum Full Analyze Stalled

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2005-10-03 20:12:25 Re: Vacuum Full Analyze Stalled
Previous Message David Fetter 2005-10-03 19:21:21 Re: [HACKERS] External Sort timing debug statements