From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: XLog changes for 9.3 |
Date: | 2012-06-07 16:36:37 |
Message-ID: | 201206071836.37718.andres@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thursday, June 07, 2012 05:35:11 PM Heikki Linnakangas wrote:
> On 07.06.2012 17:18, Andres Freund wrote:
> > On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:
> >> 3. Move the only field, xl_rem_len, from the continuation record header
> >> straight to the xlog page header, eliminating XLogContRecord altogether.
> >> This makes it easier to calculate in advance how much space a WAL record
> >> requires, as it no longer depends on how many pages it has to be split
> >> across. This wastes 4-8 bytes on every xlog page, but that's not much.
> >
> > +1. I don't think this will waste a measureable amount in real-world
> > scenarios. A very big percentag of pages have continuation records.
>
> Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on
> 64-bit architectures) even when there is a continuation record, because
> of alignment:
>
> typedef struct XLogPageHeaderData
> {
> uint16 xlp_magic; /* magic value for correctness checks */
> uint16 xlp_info; /* flag bits, see below */
> TimeLineID xlp_tli; /* TimeLineID of first record on
> XLogRecPtr xlp_pageaddr; /* XLOG address of this page */
>
> + uint32 xlp_rem_len; /* bytes remaining of continued record */
> } XLogPageHeaderData;
>
> The page header is currently 16 bytes in length, so adding a 4-byte
> field to it bumps the aligned size to 24 bytes. Nevertheless, I think we
> can well live with that.
At that point we can just do the
#define SizeofXLogPageHeaderData (offsetof(XLogPageHeaderData, xlp_pageaddr) +
sizeof(uint32))
dance. If the record can be smeared over two pages there is no point in
storing it aligned. Then we don't waste any additional space in comparison to
the current state.
> > If we do that we can remove all the aligment padding as well. Which would
> > be a problem for you anyway, wouldn't it?
> It's not a problem. You just MAXALIGN the size of the record when you
> calculate how much space it needs, and then all records become naturally
> MAXALIGNed. We could quite easily remove the alignment on-disk if we
> wanted to, ReadRecord() already always copies the record to an aligned
> buffer, but I wasn't planning to do that.
Whats the reasoning for having alignment on disk if the records aren't stored
continually?
> >> These changes will help the XLogInsert scaling patch, by making the
> >> space calculations simpler. In essence, to reserve space for a WAL
> >> record of size X, you just need to do "bytepos += X". There's a lot
> >> more details with that, like mapping from the contiguous byte position
> >> to an XLogRecPtr that takes page headers into account, and noticing
> >> RedoRecPtr changes safely, but it's a start.
> >
> > Hm. Wouldn't you need to remove short/long page headers for that as well?
>
> No, those are ok because they're predictable.
I haven't read your scalability patch, so I am not really sure what you
need...
The "bytepos += X" from above isn't as easy that way. But yes, its not that
complicated.
> Although it would make the
> mapping simpler. To convert from a contiguous xlog byte position that
> excludes all headers, to XLogRecPtr, you need to do something like this
> (I just made this up, probably has bugs, but it's about this complex):
>
> #define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
> #define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) *
> UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD)
>
> uint64 xlogrecptr;
> uint64 full_segments = bytepos / UsableBytesInSegment;
> int offset_in_segment = bytepos % UsableBytesInSegment;
>
> xlogrecptr = full_segments * XLOG_SEG_SIZE;
> /* is it on the first page? */
> if (offset_in_segment < XLOG_BLCKSZ - SizeOfXLogLongPHD)
> xlogrecptr += SizeOfXLogLongPHD + offset_in_segment;
> else
> {
> /* first page is fully used */
> xlogrecptr += XLOG_BLCKSZ;
> /* add other full pages */
> offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
> xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ;
> /* and finally offset within the last page */
> xlogrecptr += offset_in_segment % UsableBytesInPage;
> }
> /* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */
> XLogRecPtr.xlogid = xlogrecptr >> 32;
> XLogRecPtr.xrecoff = xlogrecptr & 0xffffffff;
Its a bit more complicated than that, records can span a good bit more than
just two pages (even more than two segments) and you need to decide for every
of those whether it has a long or a short header.
> Capsulated in a function, that's not too bad. But if we want to make
> that simpler, one idea would be to allocate the whole 1st page in each
> WAL segment for metadata. That way all the actual xlog pages would hold
> the same amount of xlog data.
Its a bit easier then, but you probably still need to loop over the size and
subtract till you reached the final point. Its no problem to produce a 100MB
wal record. But then thats probably nothing to design for.
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2012-06-07 16:41:39 | Re: "page is not marked all-visible" warning in regression tests |
Previous Message | Tom Lane | 2012-06-07 16:34:01 | Re: slow dropping of tables, DropRelFileNodeBuffers, tas |