From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | guopa(at)vmware(dot)com |
Cc: | masao(dot)fujii(at)oss(dot)nttdata(dot)com, alvherre(at)2ndquadrant(dot)com, a(dot)lubennikova(at)postgrespro(dot)ru, apraveen(at)pivotal(dot)io, leiwang(at)pivotal(dot)io, thomas(dot)munro(at)gmail(dot)com, horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: standby recovery fails (tablespace related) (tentative patch and discussion) |
Date: | 2021-01-05 01:07:24 |
Message-ID: | 20210105.100724.1255681825377651645.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 8 Jul 2020 12:56:44 +0000, Paul Guo <guopa(at)vmware(dot)com> wrote in
> On 2020/01/15 19:18, Paul Guo wrote:
> I further fixed the last test failure (due to a small bug in the test, not in code). Attached are the new patch series. Let's see the CI pipeline result.
>
> Thanks for updating the patches!
>
> I started reading the 0003 patch.
>
> The approach that the 0003 patch uses is not the perfect solution.
> If the standby crashes after tblspc_redo() removes the directory and before
> its subsequent COMMIT record is replayed, PANIC error would occur since
> there can be some unresolved missing directory entries when we reach the
> consistent state. The problem would very rarely happen, though...
> Just idea; calling XLogFlush() to update the minimum recovery point just
> before tblspc_redo() performs destroy_tablespace_directories() may be
> safe and helpful for the problem?
It seems to me that what the current patch does is too complex. What
we need to do here is to remember every invalid operation then forget
it when the prerequisite object is dropped.
When a table space is dropped before consistency is established, we
don't need to care what has been performed inside the tablespace. In
this perspective, it is enough to remember tablespace ids when failed
to do something inside it due to the absence of the tablespace and
then forget it when we remove it. We could remember individual
database id to show them in error messages, but I'm not sure it's
useful. The reason log_invalid_page records block numbers is to allow
the machinery handle partial table truncations, but this is not the
case since dropping tablespace cannot leave some of containing
databases.
As the result, we won't see an unresolved invalid operations in a
dropped tablespace.
Am I missing something?
dbase_redo:
+ if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ {
+ XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
This means "record the belonging table space directory if it is not
found OR it is not a directory". The former can be valid but the
latter is unconditionally can not (I don't think we bother considering
symlinks there).
+ /*
+ * Source directory may be missing. E.g. the template database used
+ * for creating this database may have been dropped, due to reasons
+ * noted above. Moving a database from one tablespace may also be a
+ * partner in the crime.
+ */
+ if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
+ {
+ XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, src_path);
This is a part of *creation* of the target directory. Lack of the
source directory cannot be valid even if the source directory is
dropped afterwards in the WAL stream and we can allow that if the
*target* tablespace is dropped afterwards. As the result, as I
mentioned above, we don't need to record about the database directory.
By the way the name XLogLogMiss.. is somewhat confusing. How about
XLogReportMissingDir (named after report_invalid_page).
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Josef Šimánek | 2021-01-05 01:32:55 | Re: [PATCH] Simple progress reporting for COPY command |
Previous Message | Thomas Munro | 2021-01-05 01:01:03 | Re: language cleanups in code and docs |