Optimizing around retained tuples

From: James Parks <james(dot)parks(at)meraki(dot)net>
To: postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Optimizing around retained tuples
Date: 2017-03-21 19:24:52
Message-ID: CAJ3Xv+gQwqnkdzA-h=uxWSbha_L76a_Ea9F=h2UBL2s37Gxtdg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi pgsql-performance!

So I have a Postgresql database -- version "PostgreSQL 9.4.8 on
x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit",
specifically.

In it, I have essentially two categories of tables:
- small tables that are updated frequently, and tend to be often queried in
their entirety (seq scan)
- large tables that are updated infrequently, and tend to be often queried
using an index

Let us assume that I have a table "A" that falls in the small category, and
a table "B" that falls in the large category.

The problem I'm having is that it is very difficult to do any amount of
maintenance on tables like B without DOSing any queries that reference
table A. The reason, as far as I can tell, is that having any statement run
against a table like B results in all updates to table A being kept around
until the statement on table B completes (as per the READ COMMITTED
transaction isolation level -- statements against B must only see rows
committed before they started). This makes sense -- it's required to keep
ACID.

However, there are times where I need to do large operations to B -- and
these operations can be literally anything, but I'll focus on my most
recent need: running a "pg_dump" against table B.

I should add that table B is never involved with any query that touches
table A -- in this case, it is an append-only table that records changes to
a table C that is equivalently never involved with table A.

So, on to the data from which I base the above claims:

Let table A have 43 thousand rows:
database=> select count(*) from a;
-[ RECORD 1 ]
count | 43717
Time: 10447.681 ms

Let table B have 21 million rows:
meraki_shard_production=> select count(id) from b;
-[ RECORD 1 ]---
count | 21845610
Time: 116873.051 ms

Assume a pg_dump operation is copying table B, i.e. there's a currently
running query that looks like "COPY public.b (id, ...) TO STDOUT"

Then this is what I get for running a verbose vacuum against A:

database=> vacuum verbose a;
INFO: vacuuming "public.a"
INFO: index "a_pkey" now contains 2119583 row versions in 9424 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.04s/0.03u sec elapsed 0.49 sec.
INFO: "a": found 0 removable, 2112776 nonremovable row versions in 185345
out of 186312 pages
DETAIL: 2069676 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 1.28s/1.15u sec elapsed 22.93 sec.
INFO: vacuuming "pg_toast.pg_toast_18889"
INFO: index "pg_toast_18889_index" now contains 31 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_18889": found 0 removable, 31 nonremovable row versions in
7 out of 7 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 23035.282 ms

... and here's how long it takes to read all of the rows:
database=> select max(an unindexed bigint column) from a;
-[ RECORD 1 ]--------
max | <some number>
Time: 10624.368 ms

Running this another time immediately afterward (to show the cached speed)
returns:
Time: 13782.363 ms

If I go to a separate database cluster that has an equivalent schema, and
roughly equivalent table a (+- 2% on the number of rows), the above queries
look more like this:

meraki_shard_production=> vacuum verbose a;
INFO: vacuuming "public.a"
INFO: index "a_pkey" now contains 42171 row versions in 162 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "a": found 487 removable, 42286 nonremovable row versions in 7809
out of 7853 pages
DETAIL: 373 dead row versions cannot be removed yet.
There were 42436 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.02u sec elapsed 0.01 sec.
INFO: vacuuming "pg_toast.pg_toast_19037"
INFO: index "pg_toast_19037_index" now contains 57 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_19037": found 0 removable, 57 nonremovable row versions in
12 out of 12 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 32.890 ms

database=> select max(the same unindexed bigint column) from a;
max
-----------------
<some number>
(1 row)
Time: 16.696 ms
(The second iteration takes 15.320 ms)

So, the way I see it, my problem boils down to table "A" getting roughly
100-1000x slower when it gets roughly 20-50x bigger (depending if you
measure in pages or tuples). Unfortunately, in my use case, table "A" acts
as a join table for a lot of aspects of our company's webapp. Every 10
minutes, the table is queried for 35 million rows via sequential scan (~800
seq scans per minute, ~1.3 per second on average), and 6.5 million rows via
index lookup. When a sequential scan over 40k rows takes less than 1
second, everything is fine -- when it takes 10+ seconds the database starts
to slow down significantly. Thankfully, queries can share sequential scans,
but you can imagine how the responsiveness of the webapp might suffer as a
consequence. There's also the secondary effect that, should the query on B
complete, there now exist many queries against A (and other related tables)
that are slow enough to potentially increase the size of A even further. It
is not uncommon for queries involving A to start taking upwards of 30
minutes to complete, when they usually complete in roughly 300ms, after
some maintenance query against B has completed.

Our go-to solution has been to detect and stop these maintenance queries if
they take too long, and then to CLUSTER table A. This puts a cap on how
long any maintenance query can take -- down to somewhere around 1 hour.

And thus my query to you guys:

What can I do to keep running long maintenance operations on large tables
(SELECTing significant fractions of B, DELETEing significant fractions of
B, running VACUUM FULL on B) without denying other Postgresql backends
their ability to efficiently query table A? Or, in other words, how do I
avoid incurring the cost of transaction isolation for queries against B on
a case-by-case basis?

Anything is on the table for implementation:
- moving tables to a different database / cluster / completely different
DBMS system
- designing an extension to tune either sets of queries
- partitioning tables
- etc
... although the simpler the better. If you were in this position, what
would you do?

Regards,
James

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2017-03-22 01:56:34 Re: Optimizing around retained tuples
Previous Message Andreas Karlsson 2017-03-20 15:40:50 Re: Auto generate number in Postgres-9.1.