Re: recovering from "found xmin ... from before relfrozenxid ..."

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>
Cc: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, MBeena Emerson <mbeena(dot)emerson(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Date: 2020-08-07 07:15:17
Message-ID: CAE9k0P=B_F1yYKpYE+EcQ25=pzvvYcm=R6X1DbeptTi2P03U+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi
<rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com> wrote:
>
> I have been doing some user-level testing of this feature, apart from sanity test for extension and it's functions
>
> I have tried to corrupt tuples and then able to fix it using heap_force_freeze/kill functions.
>
>
> --corrupt relfrozenxid, cause vacuum failed.
>
> update pg_class set relfrozenxid = (relfrozenxid::text::integer + 10)::text::xid where relname = 'test_tbl';
>
> UPDATE 1
>
> insert into test_tbl values (2, 'BBB');
>
>
> postgres=# vacuum test_tbl;
>
> ERROR: found xmin 507 from before relfrozenxid 516
>
> CONTEXT: while scanning block 0 of relation "public.test_tbl"
>
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
> a | b | ctid | xmin | xmax
>
> ---+-----+-------+------+------
>
> 1 | AAA | (0,1) | 505 | 0
>
> 2 | BBB | (0,2) | 507 | 0
>
> (2 rows)
>
>
> --fixed using heap_force_freeze
>
> postgres=# select heap_force_freeze('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
>
> heap_force_freeze
>
> -------------------
>
>
> postgres=# vacuum test_tbl;
>
> VACUUM
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
> a | b | ctid | xmin | xmax
>
> ---+-----+-------+------+------
>
> 1 | AAA | (0,1) | 505 | 0
>
> 2 | BBB | (0,2) | 2 | 0
>
> (2 rows)
>
>
> --corrupt table headers in base/oid. file, cause table access failed.
>
> postgres=# select ctid, * from test_tbl;
>
> ERROR: could not access status of transaction 4294967295
>
> DETAIL: Could not open file "pg_xact/0FFF": No such file or directory.
>
>
> --removed corrupted tuple using heap_force_kill
>
> postgres=# select heap_force_kill('test_tbl'::regclass, ARRAY['(0,2)']::tid[]);
>
> heap_force_kill
>
> -----------------
>
>
>
> (1 row)
>
>
> postgres=# select ctid, * from test_tbl;
>
> ctid | a | b
>
> -------+---+-----
>
> (0,1) | 1 | AAA
>
> (1 row)
>
>
> I will be continuing with my testing with the latest patch and update here if found anything.
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com> wrote:
>>
>> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>> >
>> > Hi Robert,
>> >
>> > Thanks for the review. Please find my comments inline:
>> >
>> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> > >
>> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>> > > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
>> > >
>> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>> > >
>> > > the -> a
>> > >
>> > > I also suggest: heap table -> relation, because we might want to
>> > > extend this to other cases later.
>> > >
>> >
>> > Corrected.
>> >
>> > > +-- toast table.
>> > > +begin;
>> > > +create table ttab(a text);
>> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
>> > > 65), '') from generate_series(1,10000);
>> > > +select * from ttab where xmin = 2;
>> > > + a
>> > > +---
>> > > +(0 rows)
>> > > +
>> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
>> > > + heap_force_freeze
>> > > +-------------------
>> > > +
>> > > +(1 row)
>> > > +
>> > >
>> > > I don't understand the point of this. You're not testing the function
>> > > on the TOAST table; you're testing it on the main table when there
>> > > happens to be a TOAST table that is probably getting used for
>> > > something. But that's not really relevant to what is being tested
>> > > here, so as written this seems redundant with the previous cases.
>> > >
>> >
>> > Yeah, it's being tested on the main table, not on a toast table. I've
>> > removed this test-case and also restricted direct access to the toast
>> > table using heap_force_kill/freeze functions. I think we shouldn't be
>> > using these functions to do any changes in the toast table. We will
>> > only use these functions with the main table and let VACUUM remove the
>> > corresponding data chunks (pointed by the tuple that got removed from
>> > the main table).
>> >
>> > Another option would be to identify all the data chunks corresponding
>> > to the tuple (ctid) being killed from the main table and remove them
>> > one by one. We will only do this if the tuple from the main table that
>> > has been marked as killed has an external storage. We will have to add
>> > a bunch of code for this otherwise we can let VACUUM do this for us.
>> > Let me know your thoughts on this.
>> >
>> > > +-- test pg_surgery functions with the unsupported relations. Should fail.
>> > >
>> > > Please name the specific functions being tested here in case we add
>> > > more in the future that are tested separately.
>> > >
>> >
>> > Done.
>> >
>> > > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
>> > >
>> > > I think we could drop _funcs from the file name.
>> > >
>> >
>> > Done.
>> >
>> > > +#ifdef PG_MODULE_MAGIC
>> > > +PG_MODULE_MAGIC;
>> > > +#endif
>> > >
>> > > The #ifdef here is not required, and if you look at other contrib
>> > > modules you'll see that they don't have it.
>> > >
>> >
>> > Okay, done.
>> >
>> > > I don't like all the macros at the top of the file much. CHECKARRVALID
>> > > is only used in one place, so it seems to me that you might as well
>> > > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
>> > >
>> >
>> > Done.
>> >
>> > > Once you do that, heap_force_common() can just compute the number of
>> > > array elements once, instead of doing it once inside ARRISEMPTY, then
>> > > again inside SORT, and then a third time to initialize ntids. You can
>> > > just have a local variable in that function that is set once and then
>> > > used as needed. Then you are only doing ARRNELEMS once, so you can get
>> > > rid of that macro too. The same technique can be used to get rid of
>> > > ARRPTR. So then all the macros go away without introducing any code
>> > > duplication.
>> > >
>> >
>> > Done.
>> >
>> > > +/* Options to forcefully change the state of a heap tuple. */
>> > > +typedef enum HTupleForceOption
>> > > +{
>> > > + FORCE_KILL,
>> > > + FORCE_FREEZE
>> > > +} HTupleForceOption;
>> > >
>> > > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
>> > > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>> >
>> > Done.
>> >
>> > Also, how about
>> > > option -> operation?
>> > >
>> >
>> > I think both look okay to me.
>> >
>> > > + return heap_force_common(fcinfo, FORCE_KILL);
>> > >
>> > > I think it might be more idiomatic to use PG_RETURN_DATUM here. I
>> > > know it's the same thing, though, and perhaps I'm even wrong about the
>> > > prevailing style.
>> > >
>> >
>> > Done.
>> >
>> > > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
>> > >
>> > > I think this is unnecessary. It's an enum with 2 values.
>> > >
>> >
>> > Removed.
>> >
>> > > + if (ARRISEMPTY(ta))
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> > > + errmsg("empty tid array")));
>> > >
>> > > I don't see why this should be an error. Why can't we just continue
>> > > normally and if it does nothing, it does nothing? You'd need to change
>> > > the do..while loop to a while loop so that the end condition is tested
>> > > at the top, but that seems fine.
>> > >
>> >
>> > I think it's okay to have this check. So, just left it as-is. We do
>> > have such checks in other contrib modules as well wherever the array
>> > is being passed as an input to the function.
>> >
>> > > + rel = relation_open(relid, AccessShareLock);
>> > >
>> > > Maybe we should take RowExclusiveLock, since we are going to modify
>> > > rows. Not sure how much it matters, though.
>> > >
>> >
>> > I don't know how it would make a difference, but still as you said
>> > replaced AccessShare with RowExclusive.
>> >
>> > > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
>> > > + errmsg("must be superuser or object owner to use %s.",
>> > > + force_opt == FORCE_KILL ? "heap_force_kill()" :
>> > > + "heap_force_freeze()")));
>> > >
>> > > This is the wrong way to do a permissions check, and it's also the
>> > > wrong way to write an error message about having failed one. To see
>> > > the correct method, grep for pg_class_aclcheck. However, I think that
>> > > we shouldn't in general trust the object owner to do this, unless the
>> > > super-user gave permission. This is a data-corrupting operation, and
>> > > only the boss is allowed to authorize it. So I think you should also
>> > > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
>> > > have this check as a backup. Then, the superuser is always allowed,
>> > > and if they choose to GRANT EXECUTE on this function to some users,
>> > > those users can do it for their own relations, but not others.
>> > >
>> >
>> > Done.
>> >
>> > > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> > > + errmsg("only heap AM is supported")));
>> > > +
>> > > + check_relation_relkind(rel);
>> > >
>> > > Seems like these checks are in the wrong order.
>> >
>> > I don't think there is anything wrong with the order. I can see the
>> > same order in other contrib modules as well.
>> >
>> > Also, maybe you could
>> > > call the function something like check_relation_ok() and put the
>> > > permissions test, the relkind test, and the relam test all inside of
>> > > it, just to tighten up the code in this main function a bit.
>> > >
>> >
>> > Yeah, I've added a couple of functions named sanity_check_relation and
>> > sanity_check_tid_array and shifted all the sanity checks there.
>> >
>> > > + if (noffs > maxoffset)
>> > > + ereport(ERROR,
>> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>> > > + errmsg("number of offsets specified for block %u exceeds the max
>> > > offset number %u",
>> > > + blkno, maxoffset)));
>> > >
>> > > Hmm, this doesn't seem quite right. The actual problem is if an
>> > > individual item pointer's offset number is greater than maxoffset,
>> > > which can be true even if the total number of offsets is less than
>> > > maxoffset. So I think you need to remove this check and add a check
>> > > inside the loop which follows that offnos[i] is in range.
>> > >
>> >
>> > Agreed and done.
>> >
>> > > The way you've structured that loop is actually problematic -- I don't
>> > > think we want to be calling elog() or ereport() inside a critical
>> > > section. You could fix the case that checks for an invalid force_opt
>> > > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
>> > > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
>> > > NOTICE case you have here is a bigger problem.
>> >
>> > Done.
>> >
>> > You really cannot
>> > > modify the buffer like this and then decide, oops, never mind, I think
>> > > I won't mark it dirty or write WAL for the changes. If you do that,
>> > > the buffer is still in memory, but it's now been modified. A
>> > > subsequent operation that modifies it will start with the altered
>> > > state you created here, quite possibly leading to WAL that cannot be
>> > > correctly replayed on the standby. In other words, you've got to
>> > > decide for certain whether you want to proceed with the operation
>> > > *before* you enter the critical section. You also need to emit any
>> > > messages before or after the critical section. So you could:
>> > >
>> >
>> > This is still not clear. I think Robert needs to respond to my earlier comment.
>> >
>> > > I believe this violates our guidelines on message construction. Have
>> > > two completely separate messages -- and maybe lose the word "already":
>> > >
>> > > "skipping tid (%u, %u) because it is dead"
>> > > "skipping tid (%u, %u) because it is unused"
>> > >
>> > > The point of this is that it makes it easier for translators.
>> > >
>> >
>> > Done.
>> >
>> > > I see very little point in what verify_tid() is doing. Before using
>> > > each block number, we should check that it's less than or equal to a
>> > > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
>> > > any case to avoid funny errors; and then the check here against
>> > > specifically InvalidBlockNumber is redundant. For the offset number,
>> > > same thing: we need to check each offset against the page's
>> > > PageGetMaxOffsetNumber(page); and if we do that then we don't need
>> > > these checks.
>> > >
>> >
>> > Done.
>> >
>> > Please check the attached patch for the changes.
>>
>> I also looked at this version patch and have some small comments:
>>
>> + Oid relid = PG_GETARG_OID(0);
>> + ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
>> + ItemPointer tids;
>> + int ntids;
>> + Relation rel;
>> + Buffer buf;
>> + Page page;
>> + ItemId itemid;
>> + BlockNumber blkno;
>> + OffsetNumber *offnos;
>> + OffsetNumber offno,
>> + noffs,
>> + curr_start_ptr,
>> + next_start_ptr,
>> + maxoffset;
>> + int i,
>> + nskippedItems,
>> + nblocks;
>>
>> You declare all variables at the top of heap_force_common() function
>> but I think we can declare some variables such as buf, page inside of
>> the do loop.
>>
>> ---
>> + if (offnos[i] > maxoffset)
>> + {
>> + ereport(NOTICE,
>> + errmsg("skipping tid (%u, %u) because it
>> contains an invalid offset",
>> + blkno, offnos[i]));
>> + continue;
>> + }
>>
>> If all tids on a page take the above path, we will end up logging FPI
>> in spite of modifying nothing on the page.
>>
>> ---
>> + /* XLOG stuff */
>> + if (RelationNeedsWAL(rel))
>> + log_newpage_buffer(buf, true);
>>
>> I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kasahara Tatsuhito 2020-08-07 07:38:13 Re: Creating a function for exposing memory usage of backend process
Previous Message Konstantin Knizhnik 2020-08-07 07:08:23 Re: [Patch] Optimize dropping of relation buffers using dlist