From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work |
Date: | 2022-01-19 19:07:35 |
Message-ID: | 20220119190735.yorvuilxugkubjnw@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2022-01-19 13:34:21 -0500, Tom Lane wrote:
> Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> writes:
> > On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123(at)gmail(dot)com> wrote:
> >> Maybe I'm missing something but wouldn't
> >> https://commitfest.postgresql.org/36/3448/ better solve the problem?
>
> > The error can cause the new background process proposed there in that
> > thread to restart, which is again costly. Since we have LOG-only and
> > continue behavior in CheckPointSnapBuild already, having the same
> > behavior for CheckPointLogicalRewriteHeap helps a lot.
>
> [ stares at CheckPointLogicalRewriteHeap for awhile ... ]
>
> This code has got more problems than that. It took me awhile to
> absorb it, but we don't actually care about the contents of any of
> those files; all of the information is encoded in the file *names*.
I'm not following - we *do* need the contents of the files? They're applied
as-needed in ApplyLogicalMappingFile().
> (This strikes me as probably not a very efficient design, compared
> to putting the same data into a single text file; but for now I'll
> assume we're not up for a complete rewrite.) That being the case,
> I wonder what it is we expect fsync'ing the surviving files to do
> exactly. We should be fsync'ing the directory not the files
> themselves, no?
Fsyncing the directory doesn't guarantee anything about the contents of
files. But, you're right, we need an fsync of the directory too.
> Other things that seem poorly thought out:
>
> * Why is the check for "map-" prefix after, rather than before,
> the lstat?
It doesn't seem to matter much - there shouldn't be a meaningful amount of
other files in there.
> * Why is it okay to ignore lstat failure? Seems like we might
> as well not even have the lstat.
Yea, that seems odd, not sure why that ended up this way. I guess the aim
might have been to tolerate random files we don't have permissions for or
such?
> * The sscanf on the file name would not notice trailing junk,
> such as an editor backup marker. Is that okay?
I don't really see a problem with it - there shouldn't be other files matching
the pattern - but it couldn't hurt to check the pattern matches exhaustively.
> As far as the patch itself goes, I agree that failure to unlink
> is noncritical, because such a file would have no further effect
> and we can just ignore it.
I don't agree. We iterate through the directory regularly on systems with
catalog changes + logical decoding. An ever increasing list of gunk will make
that more and more expensive. And I haven't heard a meaningful reason why we
would have map-* files that we can't remove.
Ignoring failures like this just makes problems much harder to debug and they
tend to bite harder for it.
> I think I also agree that failure of the sscanf is noncritical, because the
> implication of that is that the file name doesn't conform to our
> expectations, which means it's basically just like the check that causes us
> to ignore names not starting with "map-". (Actually, isn't the separate
> check for "map-" useless, given that sscanf will make the equivalent check?)
Well, this way only files starting with "map-" are expected to conform to a
strict format, the rest is ignored?
> Anyway, I think possibly we can drop the bottom half of the loop
> (the part trying to fsync non-removed files) in favor of fsync'ing
> the directory afterwards. Thoughts?
I don't think that'd be correct.
In short: We should add a directory fsync, I'm fine with improving the error
checking, but the rest seems like a net-negative with no convincing reasoning.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2022-01-19 19:14:13 | Re: do only critical work during single-user vacuum? |
Previous Message | Tom Lane | 2022-01-19 18:45:07 | Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work |