Re: Unlogged relations and WAL-logging

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged relations and WAL-logging
Date: 2023-07-07 15:21:44
Message-ID: 868b64cf-0d0f-7d49-d985-a0574a8678e1@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28/01/2022 15:57, Robert Haas wrote:
> On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> Unlogged relations are not WAL-logged, but creating the init-fork is.
>> There are a few things around that seem sloppy:
>>
>> 1. In index_build(), we do this:
>>
>>> */
>>> if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
>>> !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
>>> {
>>> smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
>>> indexRelation->rd_indam->ambuildempty(indexRelation);
>>> }
>>
>> Shouldn't we call log_smgrcreate() here? Creating the init fork is
>> otherwise not WAL-logged at all.
>
> Yes, that's a bug.

Pushed and backpatched this patch (commit 3142a8845b).

>> 2. Some implementations of ambuildempty() use the buffer cache (hash,
>> gist, gin, brin), while others bypass it and call smgrimmedsync()
>> instead (btree, spgist, bloom). I don't see any particular reason for
>> those decisions, it seems to be based purely on which example the author
>> happened to copy-paste.
>
> I thought that this inconsistency was odd when I was developing the
> unlogged feature, but I tried to keep each routine's ambuildempty()
> consistent with whatever ambuild() was doing. I don't mind if you want
> to change it, though.
>
>> 3. Those ambuildempty implementations that bypass the buffer cache use
>> smgrwrite() to write the pages. That doesn't make any difference in
>> practice, but in principle it's wrong: You are supposed to use
>> smgrextend() when extending a relation.
>
> That's a mistake on my part.
>
>> 4. Also, the smgrwrite() calls are performed before WAL-logging the
>> pages, so the page that's written to disk has 0/0 as the LSN, not the
>> LSN of the WAL record. That's harmless too, but seems a bit sloppy.
>
> That is also a mistake on my part.

I'm still sitting on these fixes. I think the patch I posted still makes
sense, but I got carried away with a more invasive approach that
introduces a whole new set of functions for bulk-creating a relation,
which would handle WAL-logging, smgrimmedsync() and all that (see
below). We have some repetitive, error-prone code in all the index build
functions for that. But that's not backpatchable, so I'll rebase the
original approach next week.

>> 5. In heapam_relation_set_new_filenode(), we do this:
>>
>>>
>>> /*
>>> * If required, set up an init fork for an unlogged table so that it can
>>> * be correctly reinitialized on restart. An immediate sync is required
>>> * even if the page has been logged, because the write did not go through
>>> * shared_buffers and therefore a concurrent checkpoint may have moved the
>>> * redo pointer past our xlog record. Recovery may as well remove it
>>> * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>>> * record. Therefore, logging is necessary even if wal_level=minimal.
>>> */
>>> if (persistence == RELPERSISTENCE_UNLOGGED)
>>> {
>>> Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>>> rel->rd_rel->relkind == RELKIND_MATVIEW ||
>>> rel->rd_rel->relkind == RELKIND_TOASTVALUE);
>>> smgrcreate(srel, INIT_FORKNUM, false);
>>> log_smgrcreate(newrnode, INIT_FORKNUM);
>>> smgrimmedsync(srel, INIT_FORKNUM);
>>> }
>>
>> The comment doesn't make much sense, we haven't written nor WAL-logged
>> any page here, with nor without the buffer cache. It made more sense
>> before commit fa0f466d53.
>
> Well, it seems to me (and perhaps I am just confused) that complaining
> that there's no page written here might be a technicality. The point
> is that there's no synchronization between the work we're doing here
> -- which is creating a fork, not writing a page -- and any concurrent
> checkpoint. So we both need to log it, and also sync it immediately.

I see. I pushed the fix from the other thread that makes smgrcreate()
call register_dirty_segment (commit 4b4798e13). I believe that makes
this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
with a redo pointer greater than this WAL record, it must've received
the fsync request created by smgrcreate(). That depends on the fact that
we write the WAL record *after* smgrcreate(). Subtle..

Hmm, we have a similar smgrimmedsync() call after index build, because
we have written pages directly with smgrextend(skipFsync=true). If no
checkpoints have occurred during the index build, we could call
register_dirty_segment() instead of smgrimmedsync(). That would avoid
the fsync() latency when creating an index on an empty or small index.

This is all very subtle to get right though. That's why I'd like to
invent a new bulk-creation facility that would handle this stuff, and
make the callers less error-prone.

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Schoemans Maxime 2023-07-07 15:40:43 Re: Implement missing join selectivity estimation for range types
Previous Message David G. Johnston 2023-07-07 14:04:41 Re: CHECK Constraint Deferrable