Re: Degraded performance during table rewrite

From: Mohamed Wael Khobalatte <mkhobalatte(at)grubhub(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Degraded performance during table rewrite
Date: 2020-07-04 02:26:42
Message-ID: CABZeWdy663x5pdCDGbgc3XNSRreP7hMuGkiW99Li9J+boK2cEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Jul 3, 2020 at 10:16 PM Mohamed Wael Khobalatte <
mkhobalatte(at)grubhub(dot)com> wrote:

>
> On Fri, Jul 3, 2020 at 5:26 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Mohamed Wael Khobalatte <mkhobalatte(at)grubhub(dot)com> writes:
>> > ... the migration itself runs as follows (each in a transaction, looping
>> > through records and sleeping for a bit)
>>
>> > WITH del AS (
>> > DELETE FROM #{old_table}
>> > WHERE id IN (
>> > SELECT id
>> > FROM #{old_table}
>> > WHERE id > #{max_deleted_id} -- This is the max deleted from the
>> > previous batch, we grab it programmatically.
>> > ORDER BY id ASC
>> > LIMIT #{batch_size}
>> > )
>> > RETURNING *
>> > )
>> > INSERT INTO #{table}
>> > SELECT * FROM del
>> > RETURNING id
>>
>> > This spends 150ms per batch, which climbs to 700ms per batch. A vacuum
>> of
>> > the old table lowers is back to 150ms, but I don't understand why,
>> because
>> > we structure the query to jump over all previously dead rows. There is
>> an
>> > old thread in which Tom Lane mentions that the planner might itself be
>> > walking that primary index. Is this applicable here? And is there
>> anything
>> > we can do besides more aggressive and continued vacuuming of the old
>> table
>> > (or a change in autovacuum settings)? Ideally, we want to run this
>> > overnight without much supervision.
>>
>> Yeah, given that the slowdown seems to be in the planner, and given your
>> further observation that v12 is better, I'd say that this is an issue
>> with get_actual_variable_range. That's going to be invoked to try to
>> determine the selectivity of the "WHERE id > #{max_deleted_id}" clause,
>> if the constant is past the last value in the histogram for the id
>> column.
>>
>> The improvement you see in v12 actually came in in v11, and I think
>> I'll just quote the commit log:
>>
>> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Branch: master Release: REL_11_BR [3ca930fc3] 2017-09-07 19:41:51 -0400
>>
>> Improve performance of get_actual_variable_range with recently-dead
>> tuples.
>>
>> In commit fccebe421, we hacked get_actual_variable_range() to scan the
>> index with SnapshotDirty, so that if there are many uncommitted tuples
>> at the end of the index range, it wouldn't laboriously scan through
>> all
>> of them looking for a live value to return. However, that didn't fix
>> it
>> for the case of many recently-dead tuples at the end of the index;
>> SnapshotDirty recognizes those as committed dead and so we're back to
>> the same problem.
>>
>> To improve the situation, invent a "SnapshotNonVacuumable" snapshot
>> type
>> and use that instead. The reason this helps is that, if the snapshot
>> rejects a given index entry, we know that the indexscan will mark that
>> index entry as killed. This means the next
>> get_actual_variable_range()
>> scan will proceed past that entry without visiting the heap, making
>> the
>> scan a lot faster. We may end up accepting a recently-dead tuple as
>> being the estimated extremal value, but that doesn't seem much worse
>> than
>> the compromise we made before to accept not-yet-committed extremal
>> values.
>>
>> The cost of the scan is still proportional to the number of dead index
>> entries at the end of the range, so in the interval after a mass
>> delete
>> but before VACUUM's cleaned up the mess, it's still possible for
>> get_actual_variable_range() to take a noticeable amount of time, if
>> you've
>> got enough such dead entries. But the constant factor is much much
>> better
>> than before, since all we need to do with each index entry is test its
>> "killed" bit.
>>
>> We chose to back-patch commit fccebe421 at the time, but I'm hesitant
>> to
>> do so here, because this form of the problem seems to affect many
>> fewer
>> people. Also, even when it happens, it's less bad than the case fixed
>> by commit fccebe421 because we don't get the contention effects from
>> expensive TransactionIdIsInProgress tests.
>>
>> Dmitriy Sarafannikov, reviewed by Andrey Borodin
>>
>> Discussion:
>> https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
>>
>>
>> There are a number of possibilities for working around this in your
>> particular situation, short of an upgrade to v11+. You could try doing a
>> manual VACUUM between deletion steps, but that could fail to fix it if
>> anything else is running concurrently (because the VACUUM might not think
>> it's safe to recycle the recently-dead tuples yet). I think possibly
>> a better approach is to try to avoid the situation wherein estimating
>> "WHERE id > #{max_deleted_id}" requires determining the table's true
>> endpoint id value. For that, the last id value seen in the pg_stats
>> histogram for the id column has to be greater than the max_deleted_id
>> value. So you might find that increasing the deletion batch size
>> (thereby reducing max_deleted_id) does the trick; or you could increase
>> the statistics target for that column, making the histogram larger and
>> hence (probably) making its endpoint higher.
>>
>> regards, tom lane
>>
>
> Hi Tom, thanks for your response.
>
> I did increase the target to 10_000 in my local testing, and that didn't
> do the trick, the time per batch still increases. A regular vacuum analyze
> does bring it down though, as we established. The issue is that some of
> these tables are very large (north of 400GB for some), and such as vacuums
> will take a while, so not sure if the gain is actually worth it to do them
> frequently only to see the problem come back again (and to make matters
> worse, there *will* be concurrent activity, thus making the vacuums less
> likely to do the job).
>
> One curious thing I noticed is that in my testing where I disable
> autovacuum, sometimes the batch time comes back to 150ms on its own, so
> something must be making the planner's life easier, but I can't tell which.
>
> Do you happen to know if there is an upper limit to how much time the
> planner is willing to spend on this? Since I've seen it climb to four
> seconds, I suppose not, but I am not sure. It would help us estimate the
> timing of these runs better.
>

Forgot to address one data point from your response Tom, which is
increasing batch size. That will depend on the size of tuples. We might
afford to do so for some tables but the wider the records the more trouble
we've seen with WAL build up and replication or HA lags. The plan is to
find a sweet spot, but the issue with the planner is more tricky.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2020-07-04 02:31:09 Re: Degraded performance during table rewrite
Previous Message Mohamed Wael Khobalatte 2020-07-04 02:16:04 Re: Degraded performance during table rewrite