Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amir Rohan <amir(dot)rohan(at)zoho(dot)com>
Cc: Amir Rohan <amir(dot)rohan(at)mail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, gsmith(at)gregsmith(dot)com
Subject: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Date: 2015-10-03 13:47:33
Message-ID: CAB7nPqShUDat+iWpTNWMdrJCn3mRXqaEs_=mqeDBMdKeHL9uPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 3, 2015 at 9:50 PM, Amir Rohan <amir(dot)rohan(at)zoho(dot)com> wrote:
> On 10/03/2015 02:38 PM, Michael Paquier wrote:
>> On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote:
>>> On 10/02/2015 03:33 PM, Michael Paquier wrote:
>>>>> 4) Port assignment relies on liveness checks on running servers.
>>>>> If a server is shut down and a new instantiated, the port will get
>>>>> reused, data will get trashed, and various confusing things can happen.
>>>>
>>>> Right. The safest way to do that is to check in get_free_port if a
>>>> port number is used by a registered node, and continue to loop in if
>>>> that's the case. So done.
>>>>
>>>
>>> That eliminates the "sweet and gentle" variant of the scenario, but it's
>>> susceptible to the "ABA problem":
>>> https://en.wikipedia.org/wiki/ABA_problem
>>> https://youtu.be/CmxkPChOcvw?t=786
>>
>> I learnt a new thing here. That's basically an existing problem even
>> with the existing perl test infrastructure relying on TestLib.pm when
>> tests are run in parallel. What we would need here is a global mapping
>> file storing all the port numbers used by all the nodes currently in
>> the tests.
>>
>
> Yeah, a poorman's way to ensure ports aren't reused (I wasn't very
> clear at top of post ) is something like:
>
> global nPortsAssigned = 0;
>
> AssignPort():
> basePort = BASE_PORT; # the lowest port we use
> while(!available(basePort+nPortsAssigned)):
> basePort++
>
> nPortsAssigned++
>
> return basePort;
>
> It has its glaring faults, but would probably work ok.
> In any case, I'm sure you can do better.

Yeah, this would improve the exiting port lookup. I don't mind adding
a global variable in get_free_port for this purpose. This would
accelerate finding a free port in may cases for sure.

>>>
>>> Granted, you have to try fairly hard to shoot yourself in the leg,
>>> but since the solution is so simple, why not? If we never reuse ports
>>> within a single test, this goes away.
>>
>> Well, you can reuse the same port number in a test. Simply teardown
>> the existing node and then recreate a new one. I think that port
>> number assignment to a node should be transparent to the caller, in
>> our case the perl test script holding a scenario.
>>
>
> I was using you *never* want to reuse port numbers. That is, as long
> as the lib ensures we never reuse ports within one test, all kinds
> of corner cases are eliminated.

Hm, sure. Though I don't really why that would be mandatory to enforce
this condition as long as the list of ports occupied is in a single
place (as long as tests are not run in parallel...).

>>>>> 5) Servers are shutdown with -m 'immediate', which can lead to races
>>>>> in the script when archiving is turned on. That may be good for some
>>>>> tests, but there's no control over it.
>>>>
>>>> I hesitated with fast here actually. So changed this way. We would
>>>> want as wall a teardown command to stop the node with immediate and
>>>> unregister the node from the active list.
>>>>
>>>
>>> In particular, I was shutting down an archiving node and the archiving
>>> was truncated. I *think* smart doesn't do this. But again, it's really
>>> that the test writer can't easily override, not that the default is wrong.
>>
>> Ah, OK. Then fast is just fine. It shuts down the node correctly.
>> "smart" would wait for all the current connections to finish but I am
>> wondering if it currently matters here: I don't see yet a clear use
>> case yet where it would make sense to have multi-threaded script... If
>> somebody comes back with a clear idea here perhaps we could revisit
>> that but it does not seem worth it now.
>>
>
> My mistake. Perhaps what would be useful though is a way
> to force "messy" shutdown. a kill -9, basically. It's a recovery
> test suite, right?.

That's what the teardown is aimed at having, the immediate stop mode
would play that fairly good enough. There has been a patch from Tom
Lane around to stop a server should its postmaster.pid be missing as
well...

>>>>> Other issues:
>>>>> 6. Directory structure, used one directory per thing but more logical
>>>>> to place all things related to an instance under a single directory,
>>>>> and name them according to role (57333_backup, and so on).
>>>>
>>>> Er, well. The first version of the patch did so, and then I switched
>>>> to an approach closer to what the existing TAP facility is doing. But
>>>> well let's simplify things a bit.
>>>>
>>>
>>> I know, I know, but:
>>> 1) an instance is a "thing" in your script, so having its associated
>>> paraphernalia in one place makes more sense (maybe only to me).
>>> 2) That's often how folks (well, how I) arrange things in deployment,
>>> at least with archive/backup as symlinks to the nas.
>>>
>>> Alternatively, naming the dirs with a prefix (srv_foo_HASH,
>>> backup_foo_backupname_HASH, etc') would work as well.
>>
>> The useful portion about tempdir is that it cleans up itself
>> automatically should an error happen. It does not seem to me we want
>> use that.
>>
>
> Ensuring cleanup and directory structure aren't inherently related.
> Testlib makes cleanup easy if you're willing to accept its flat
> structure. But writing something that does cleanup and lets yo
> control directory structure is perfectly doable.
>
> The question is only if you agree or not that having per-server
> directories could be convenient. Tying into the next, if you
> don't think anyone ever need to look into these directories
> (which I disagree with), then dir structure indeed doesn't matter.

So your point is having one temp dir for the whole, right? I don't
disagree with that.

>> I would expect hackers to run those runs
>> until the end.
>
> I agree -- when you're running them , but what about when you're
> /writing/ them?

Well, I enforce CLEANUP=0 manually in TestLib.pm for now.

>>>>> 11. a canned "server is responding to queries" helper would be convenient.
>>>>
>>>> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
>>>>
>>>
>>> Block until recovery is finished, before testing. eliminate races, and
>>> avoid the stupid sleep(3) I used.
>>
>> TODO

Well. I just recalled this item in the list of things you mentioned. I
marked it but forgot to address it. It sounds right that we may want
something using pg_isready in a loop as a node in recovery would
reject connections.

>>
>>>>> 4b) server shutdown should perhaps be "smart" by default, or segmented
>>>>> into calmly_bring_to_a_close(), pull_electric_plug() and
>>>>> drop_down_the_stairs_into_swimming_pool().
>>>>
>>>> Nope, not agreeing here. "immediate" is rather violent to stop a node,
>>>> hence I have switched it to use "fast" and there is now a
>>>> teardown_node routine that uses immediate, that's more aimed at
>>>> cleanup up existing things fiercely.
>>>>
>>>
>>> Ok, not as the default, but possible to request a specific kind of
>>> shutdown. I needed smart in my case. Plus, in a scenario, you might
>>> expressly be testing behavior for a specific mode, it needs to be
>>> controllable.
>>
>> If your test script is running with a single thread, "fast" or "smart"
>> would not really make a difference, no?
>
> It would If there's a bug in one of them and I'm trying to write
> a regression test for it. Recall, this was part of broader view
> of "provide defaults, allow override" I was suggesting.

We could then extend stop_node with an optional argument containing a
mode, with fast being the default. Sounds right?

>>>> I have as well moved RecoveryTest.pm to src/test/perl so as all the
>>>> consumers of prove_check can use it by default, and decoupled
>>>> start_node from make_master and make_*_standby so as it is possible to
>>>> add for example new parameters to their postgresql.conf and
>>>> recovery.conf files before starting them.
>>>>
>>>> Thanks a lot for the feedback! Attached is an updated patch with all
>>>> the things mentioned above done. Are included as well the typo fixes
>>>> you sent upthread.
>>>> Regards,
>>>>
>>>
>>> Great! I'll look at it and post again if there's more. If any of the
>>> above extra explanations make it clearer why I suggested some changes
>>> you didn't like...
>>
>> I am attaching a new version. I found a small bug in test case 001
>> when checking if standby 2 has caught up. There is also this dump
>> function that is helpful. The -i switch in cp command has been removed
>> as well.
>>
>
> I'm sorry I didn't review the code, but honestly my perl is so rusty I'm
> afraid I'll embarrass myself :)

I don't pretend mine are good :) So we are two.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-10-03 13:53:30 Re: perlcritic
Previous Message Andres Freund 2015-10-03 13:38:10 Re: Refactoring speculative insertion with unique indexes a little