From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Bruce Momjian <bruce(at)momjian(dot)us> |
Cc: | Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Tom Kincaid <tomjohnkincaid(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> |
Subject: | Re: storing an explicit nonce |
Date: | 2021-05-27 16:03:00 |
Message-ID: | CA+Tgmobg+1Gypkyb8FbEhzt9Ve-4QF=HqrWWUN-eP2=Rqq_hdQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, May 27, 2021 at 11:19 AM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Thu, May 27, 2021 at 10:47:13AM -0400, Robert Haas wrote:
> > On Wed, May 26, 2021 at 4:40 PM Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > > You are saying that by using a non-LSN nonce, you can write out the page
> > > with a new nonce, but the same LSN, and also discard the page during
> > > crash recovery and use the WAL copy?
> >
> > I don't know what "discard the page during crash recovery and use the
> > WAL copy" means.
>
> I was asking how decoupling the nonce from the LSN allows for us to
> avoid full page writes for hint bit changes. I am guessing you are
> saying that on recovery, if we see a hint-bit-only change in the WAL
> (with a new nonce), we just throw away the page because it could be torn
> and use the WAL full page write version.
Well, in the design where the nonce is stored in the page, there is no
need for every hint-type change to appear in the WAL at all. Once per
checkpoint cycle, you need to write a full page image, as we do for
checksums or wal_log_hints. The rest of the time, you can just bump
the nonce and rewrite the page, same as we do today.
> Yes, it might be 1e100+++ more expensive too, but we don't know, and I
> am not ready to add a lot of complexity for such an unknown.
No, it can't be 1e100+++ more expensive, because it's not
realistically possible for a page to be written to disk 1e100+++ times
per checkpoint cycle. It is however entirely possible for it to be
written 100 times per checkpoint cycle. That is not something unknown
about which we need to speculate; it is easy to see that this can
happen, even on a simple test like pgbench with a data set larger than
shared buffers.
It is not right to confuse "we have no idea whether this will be
expensive" with "how expensive this will be is workload-dependent,"
which is what you seem to be doing here. If we had no idea whether
something would be expensive, then I agree that it might not be worth
adding complexity for it, or maybe some testing should be done first
to find out. But if we know for certain that in some workloads
something can be very expensive, then we had better at least talk
about whether it is worth adding complexity in order to resolve the
problem. And that is the situation here.
I am not even convinced that storing the nonce in the block is going
to be more complex, because it seems to me that the patches I posted
upthread worked out pretty cleanly. There are some things to discuss
and think about there, for sure, but it is not like we are talking
about inventing warp drive.
--
Robert Haas
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2021-05-27 16:13:07 | Re: storing an explicit nonce |
Previous Message | Bruce Momjian | 2021-05-27 16:01:16 | Re: storing an explicit nonce |