Re: [HACKERS] Unlogged tables cleanup

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, konstantin knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Unlogged tables cleanup
Date: 2019-05-14 04:28:21
Message-ID: 20190514042821.im3cxio326gnowxh@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-05-14 12:50:09 +0900, Michael Paquier wrote:
> On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote:
> > I was wrong here - I thought we WAL logged the main fork creation even
> > for unlogged tables. I think it's foolish that we don't, but we don't.
>
> Why? The main fork is not actually necessary, and the beginning of
> recovery would make sure that any existing main fork gets removed and
> that the main fork gets recreated by copying it from the init fork,
> which is WAL-logged.

Several things:

1) A main fork that's created while a base-backup is in progress might
only partially exist on a HS node. We won't necessarily remove it,
because the init fork might *not* have been copied. The init fork
will be re-created by WAL replay, but that'll be too late for the
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) call.

Which then will

a) cause ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to fail, because the
target file already exists, as mentioned elsewhere in this thread

b) cause trouble when access the unlogged table - unfortunately there
are several ways to access unlogged tables on a HS node - so the
main fork better either not exist, or be in a valid empty
state. Case in point:
COPY blarg TO STDOUT;
does not throw an error due to unlogged-ness.

If we instead properly WAL logged the MAIN fork creation b) wouldn't
be a problem.

And we could have the UNLOGGED_RELATION_INIT case re-use the
hashtable built at UNLOGGED_RELATION_CLEANUP time, and avoid a second
scan of the filesystem. As all relations that were created after the
LSN crash recovery started at would be valid, we would not need to do
a full second scan: All unlogged relations created after that LSN are
guaranteed to have a consistent main fork.

2) With a small bit of additional work, the sketch to fix 1) would allow
us to allow read access to unlogged relations during HS. We'd just
have do the UNLOGGED_RELATION_INIT work at the start of recovery.

3) I'm somewhat certain that there are relfilenode reuse issues with the
current approach, again during base backups (as there the checkpoint
logic to prevent relfilenode reuse doesn't work). I think you can
easily end up with main / vm / fsm / init forks that actually don't
belong to the same relation for a base backup. If you don't force
delete all other forks at the time the init fork record is replayed,
there's no defense against that.

There's probably more, but I gotta cook.

> So we definitely have to log the init fork, but logging the main fork
> is just useless data in the WAL record. Or you have something else in
> mind?

Well, as I argued somewhere else in this thread, I think this all should
just be one WAL record.

Something very roughly like:

struct xl_smgr_create
{
RelFileNode rnode;
ForkNumber forkNum;
bool remove_all_other_forks;
bool copy_init_to_main_fork;
size_t data_length;
char fork_contents[];
};

Presumably with a flags argument instead of separate booleans.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-05-14 04:29:33 Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Previous Message Amit Langote 2019-05-14 04:23:36 Re: [HACKERS] advanced partition matching algorithm for partition-wise join