Re: BUG #5915: OldSerXidAdd inflates pg_serial too much

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <drkp(at)csail(dot)mit(dot)edu>,<heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: <yamt(at)mwd(dot)biglobe(dot)ne(dot)jp>,<pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #5915: OldSerXidAdd inflates pg_serial too much
Date: 2011-03-06 18:32:27
Message-ID: 4D737EDB020000250003B531@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Heikki Linnakangas wrote:

> Here's what I had in mind. Can you review

The additions and modifications to the comments all look good to me.
I can see why you renamed one field and eliminated another; no
problem there. I'm really surprised, when you ignore those changes,
how little was needed to move this to checkpoint time. I hadn't
looked into that, and expected it to be much harder to do, even
though I should know better by now. :-)

I looked it over and pondered a bit, and have three concerns.

(1) Why is it safe to assume that checkpoint will occur before this
wraps around? Is it just that it takes a billion transactions, and
one can reasonably expect a checkpoint before reaching that point, or
is there some other safety net?

(2) What if there are only occassional clusters of serializable
transactions? If the one SLRU page is held so that it isn't
repeatedly truncated and re-zeroed, does it pose a risk if
transaction IDs wrap around while that page is held? (I don't think
this is a new problem with the proposed patch, it just made it more
obvious.) Should we be using RecentGlobalXmin or something in that
checkpoint function to truncate that last page when it is safe to do
so?

(3) That unnecessary SLRU flush should probably be conditioned on a
#ifdef or some not-very-visible GUC or something. With the problems
which some people have with writes glutting the I/O during checkpoint
I'd hate to add to the writes at checkpoint time just for debugging
info -- at least without a specific request for that behavior.

Other than that, it seems like a very good idea.

> do you have something to test this with?

Well, I tested it with the src/test/isolation/ tests with and without
TEST_OLDSERXID defined, so it passes that much[1]; but if Dan can
grab another day on the MIT benchmarking server for a new round of
DBT-2 tests (at least some of which should be run with TEST_OLDSERXID
defined), it would help shake out any problems. I have to say,
though, that Takashi-san seems to be the master of chasing down
corner condition problems with this patch.

Constructing test scripts for these corner cases which can be run
under the normal PostgreSQL regression testing framework is hard, but
I intend to continue to try to develop some before release. But as
Dan pointed out recently, some of these issues are really hard to hit
without pounding on the software for hours with high concurrency on a
machine with a lot of CPUs.

-Kevin

[1] Actually, the last of the tests has been showing failure when
running with TEST_OLDSERXID defined, because it gives up on the
transaction set and forces a transaction to roll back a little
earlier due to the summarization of the conflict data. I'm attaching
another file we may want to add to src/test/isolation/expected to
avoid confusion when people test a build with that defined.

Attachment Content-Type Size
multiple-row-versions_1.out application/octet-stream 821 bytes

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Itagaki Takahiro 2011-03-07 11:44:45 pg_stat_all_tables.vacuum_count corrupted after pg_stat_reset()
Previous Message Vik Reykja 2011-03-06 14:57:53 Re: Can't use WITH in a PERFORM query in PL/pgSQL?