Re: [HACKERS] pg_upgrade to clusters with a different WAL segment size

From: Jeremy Schneider <schneider(at)ardentperf(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Bossart, Nathan" <bossartn(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] pg_upgrade to clusters with a different WAL segment size
Date: 2017-11-13 21:00:52
Message-ID: CA+fnDAb+8wfxT6zU_pVQCi1TXxJvfFGqZVsut6aZ3R4+T=sfnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 10, 2017 at 4:04 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Sat, Nov 11, 2017 at 12:46 AM, Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
>> Allowing changes to the WAL segment size during pg_upgrade seems like a
>> nice way to avoid needing a dump and load, so I would like to propose
>> adding support for this. I'd be happy to submit patches for this in the
>> next commitfest.
>
> That's a worthy goal.

I'm also interested in this item and I helped Nathan with a little of
the initial testing. Also, having changed redo sizes on other
database platforms a couple times (a simple & safe runtime operation
there), it seems to me that a feature like this would benefit
PostgreSQL.

I would add that we increased the max segment size in pg10 - but the
handful of users who are in the most pain with very high activity
rates on running systems are still limited to logical upgrades or
dump-and-load to get the benefit of larger WAL segment sizes. From a
technical perspective, it doesn't seem like it should be too
complicated to implement this in pg_upgrade since you're moving into a
new cluster anyway.

On Fri, Nov 10, 2017 at 7:46 AM, Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
> We've had success with our initial testing of upgrades to larger WAL
> segment sizes, including post-upgrade pgbench runs.

Just to fill this out a little; our very first test was to take a
9.6.5 16mb-wal post-pgbench db and pg_upgrade it to 10.0 128mb-wal
with no changes except removing the WAL size from check_control_data()
then doing more pgbench runs on the same db post-upgrade. Checked for
errors or problematic variation in TPS. More of a smoke-screen than a
thorough test, but everything looks good so far.

On Fri, Nov 10, 2017 at 7:46 AM, Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
> Beyond adjusting
> check_control_data(), it looks like the 'pg_resetwal -l' call in
> copy_xact_xlog_xid() may need to be adjusted to ensure that the WAL
> starting address is set to a valid value.

This was related to one interesting quirk we observed. The pg_upgrade
tried to call pg_resetwal on the *new* database with a log sequence
number that assumes the *old* wal size. In our test, it called
"pg_resetwal -l 000000010000000200000071" which is an invalid filename
with 128mb wal segment. In order to get a sensible filename,
PostgreSQL took the "71" and wrapped three times and added to get a
new WAL filename of "000000010000000500000011".

This actually raises a really interesting concern with pg_upgrade and
different WAL segment sizes. We have WAL filenames and then we have
XLogSegNo. If pg_upgrade just chooses the next valid filename, then
XLogSegNo will decrease and overlap when the WAL segment size goes up.
If pg_upgrade calculates the next XLogSegNo then the WAL segment
filename will decrease and overlap when the WAL segment size goes
down.

from xlog_internal.h:
#define XLogFileName(fname, tli, logSegNo, wal_segsz_bytes) \
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli, \
(uint32) ((logSegNo) / XLogSegmentsPerXLogId(wal_segsz_bytes)), \
(uint32) ((logSegNo) % XLogSegmentsPerXLogId(wal_segsz_bytes)))

...

#define XLogFromFileName(fname, tli, logSegNo, wal_segsz_bytes) \
do { \
uint32 log; \
uint32 seg; \
sscanf(fname, "%08X%08X%08X", tli, &log, &seg); \
*logSegNo = (uint64) log *
XLogSegmentsPerXLogId(wal_segsz_bytes) + seg; \
} while (0)

If there's an archive_command script that simply copies WAL files
somewhere then it might overwrite old logs when filenames overlap. I
haven't surveyed all the postgres backup tools & scripts out there but
it also seems conceivable that some tools will do the equivalent of
XLogFromFileName() so that they can be aware of there are missing logs
in a recovery scenario. Those tools could conceivably get broken by
an overlapping/decremented XLogSegNo.

I haven't fully thought through replication to consider whether
anything could break there, but that's another open question.

There are a few different approaches that could be taken to determine
the next WAL sequence number.
1) simplest: increment filename's middle digit by 1, zero out the
right digit. no filename overlap, don't need to know WAL segment
size. has XLogSegNo overlap.
2) use the next valid WAL filename with segment size awareness. no
filename overlap, has XLogSegNo overlap.
3) translate old DB filename to XLogSegNo, XLogSegNo++, translate to
new DB filename. no XLogSegNo overlap, has filename overlap.
4) most complex: XLogSegNo++, translate to new DB filename, then
increase filename until it's greater than last used filename in old
db. Always has gaps, never overlaps.

I'm thinking option 4 sounds the most correct. Any thoughts from
others to the contrary?

Anything else that is worth testing to look for potential problems
after pg_upgrade with different WAL segment sizes?

-Jeremy

--
http://about.me/jeremy_schneider

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oliver Ford 2017-11-13 21:07:34 Re: Fix number skipping in to_number
Previous Message Thomas Munro 2017-11-13 20:02:45 Re: [HACKERS] UPDATE of partition key