From: | Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> |
---|---|
To: | Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru> |
Cc: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Speedup twophase transactions |
Date: | 2017-01-25 14:55:31 |
Message-ID: | CAMGcDxeoJp1S3KnwDnMJNT6sSGE6bwAiN33mpOsX3MOX2CAw=A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> We are talking about the recovery/promote code path. Specifically this
> call to KnownPreparedRecreateFiles() in PrescanPreparedTransactions().
>
> We write the files to disk and they get immediately read up in the
> following code. We could not write the files to disk and read
> KnownPreparedList in the code path that follows as well as elsewhere.
Thinking more on this.
The only optimization that's really remaining is handling of prepared
transactions that have not been committed or will linger around for
long. The short lived 2PC transactions have been optimized already via
this patch.
The question remains whether saving off a few fsyncs/reads for these
long-lived prepared transactions is worth the additional code churn.
Even if we add code to go through the KnownPreparedList, we still will
have to go through the other on-disk 2PC transactions anyways. So,
maybe not.
Regards,
Nikhils
>
> Regards,
> Nikhils
>
>
>>> The difference between those two is likely noise.
>>>
>>> By the way, in those measurements, the OS cache is still filled with
>>> the past WAL segments, which is a rather best case, no? What happens
>>> if you do the same kind of tests on a box where memory is busy doing
>>> something else and replayed WAL segments get evicted from the OS cache
>>> more aggressively once the startup process switches to a new segment?
>>> This could be tested for example on a VM with few memory (say 386MB or
>>> less) so as the startup process needs to access again the past WAL
>>> segments to recover the 2PC information it needs to get them back
>>> directly from disk... One trick that you could use here would be to
>>> tweak the startup process so as it drops the OS cache once a segment
>>> is finished replaying, and see the effects of an aggressive OS cache
>>> eviction. This patch is showing really nice improvements with the OS
>>> cache backing up the data, still it would make sense to test things
>>> with a worse test case and see if things could be done better. The
>>> startup process now only reads records sequentially, not randomly
>>> which is a concept that this patch introduces.
>>>
>>> Anyway, perhaps this does not matter much, the non-recovery code path
>>> does the same thing as this patch, and the improvement is too much to
>>> be ignored. So for consistency's sake we could go with the approach
>>> proposed which has the advantage to not put any restriction on the
>>> size of the 2PC file contrary to what an implementation saving the
>>> contents of the 2PC files into memory would need to do.
>>
>> Maybe i’m missing something, but I don’t see how OS cache can affect something here.
>>
>> Total WAL size was 0x44 * 16 = 1088 MB, recovery time is about 20s. Sequential reading 1GB of data
>> is order of magnitude faster even on the old hdd, not speaking of ssd. Also you can take a look on flame graphs
>> attached to previous message — majority of time during recovery spent in pg_qsort while replaying
>> PageRepairFragmentation, while whole xact_redo_commit() takes about 1% of time. That amount can
>> grow in case of uncached disk read but taking into account total recovery time this should not affect much.
>>
>> If you are talking about uncached access only during checkpoint than here we are restricted with
>> max_prepared_transaction, so at max we will read about hundred of small files (usually fitting into one filesystem page) which will also
>> be barely noticeable comparing to recovery time between checkpoints. Also wal segments cache eviction during
>> replay doesn’t seems to me as standard scenario.
>>
>> Anyway i took the machine with hdd to slow down read speed and run tests again. During one of the runs i
>> launched in parallel bash loop that was dropping os cache each second (while wal fragment replay takes
>> also about one second).
>>
>> 1.5M transactions
>> start segment: 0x06
>> last segment: 0x47
>>
>> patched, with constant cache_drop:
>> total recovery time: 86s
>>
>> patched, without constant cache_drop:
>> total recovery time: 68s
>>
>> (while difference is significant, i bet that happens mostly because of database file segments should be re-read after cache drop)
>>
>> master, without constant cache_drop:
>> time to recover 35 segments: 2h 25m (after that i tired to wait)
>> expected total recovery time: 4.5 hours
>>
>> --
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>>
>>
>
>
>
> --
> Nikhil Sontakke http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2017-01-25 15:52:01 | pg_ls_dir & friends still have a hard-coded superuser check |
Previous Message | Peter Eisentraut | 2017-01-25 14:53:29 | Re: sequence data type |