From: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: MultiXact\SLRU buffers configuration |
Date: | 2020-11-02 12:45:33 |
Message-ID: | 65C1B4BA-D16F-4939-978B-AC8F370F5A5E@yandex-team.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> 29 окт. 2020 г., в 18:49, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> написал(а):
>
> On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:
>>
>>
>>> 29 окт. 2020 г., в 04:32, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> написал(а):
>>>
>>> It's not my intention to be mean or anything like that, but to me this
>>> means we don't really understand the problem we're trying to solve. Had
>>> we understood it, we should be able to construct a workload reproducing
>>> the issue ...
>>>
>>> I understand what the individual patches are doing, and maybe those
>>> changes are desirable in general. But without any benchmarks from a
>>> plausible workload I find it hard to convince myself that:
>>>
>>> (a) it actually will help with the issue you're observing on production
>>>
>>> and
>>> (b) it's actually worth the extra complexity (e.g. the lwlock changes)
>>>
>>>
>>> I'm willing to invest some of my time into reviewing/testing this, but I
>>> think we badly need better insight into the issue, so that we can build
>>> a workload reproducing it. Perhaps collecting some perf profiles and a
>>> sample of the queries might help, but I assume you already tried that.
>>
>> Thanks, Tomas! This totally makes sense.
>>
>> Indeed, collecting queries did not help yet. We have loadtest environment equivalent to production (but with 10x less shards), copy of production workload queries. But the problem does not manifest there.
>> Why do I think the problem is in MultiXacts?
>> Here is a chart with number of wait events on each host
>>
>>
>> During the problem MultiXactOffsetControlLock and SLRURead dominate all other lock types. After primary switchover to another node SLRURead continued for a bit there, then disappeared.
>
> OK, so most of this seems to be due to SLRURead and
> MultiXactOffsetControlLock. Could it be that there were too many
> multixact members, triggering autovacuum to prevent multixact
> wraparound? That might generate a lot of IO on the SLRU. Are you
> monitoring the size of the pg_multixact directory?
Yes, we had some problems with 'multixact "members" limit exceeded' long time ago.
We tuned autovacuum_multixact_freeze_max_age = 200000000 and vacuum_multixact_freeze_table_age = 75000000 (half of defaults) and since then did not ever encounter this problem (~5 months).
But the MultiXactOffsetControlLock problem persists. Partially the problem was solved by adding more shards. But when one of shards encounters a problem it's either MultiXacts or vacuum causing relation truncation (unrelated to this thread).
Thanks!
Best regards, Andrey Borodin.
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2020-11-02 13:03:34 | Re: Collation versioning |
Previous Message | vignesh C | 2020-11-02 12:32:05 | Re: Split copy.c |