From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby |
Date: | 2014-06-17 23:17:40 |
Message-ID: | CAMkU=1wKqmmvCYWWDb1Pip8G1YZY7BOiSSNasqvUMEN5N_rpLQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
(Redirected from Bugs to Hackers, as it seems not to be the same bug
as the original one.)
On Mon, Jun 9, 2014 at 3:50 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> Jeff Janes wrote:
>
>> This patch seems to solve a problem I've also been having with non-existent
>> "pg_multixact/members/13D35" files in my testing of torn page write and fk
>> locks again recent 9.4. However, sometimes I've found the problem even
>> before multi wrap around occured, and my reading of this thread suggests
>> that the current issue would not show up in that case, so I don't know how
>> much comfort to take in the knowledge that the problem seems to have gone
>> away--as they say, things that go away by themselves without explanation
>> also tend to come back by themselves.
>
> I think this problem is rather related to the code touched by commit
> 722acf51a0d074d19782ad7e97ebe3fdb63fbb87, which is about removing
> unneeded files in pg_multixact/members. I checked your files, and it
> seems that pg_multixact/members files that should still exist are gone;
> for instance, pg_controldata says that the oldest multi is 2586314407,
> but if I run
> select * from pg_get_multixact_members('that value')
> I get an error saying that pg_multixact/members/218B does not exist.
>
> It seems that there are a lot of files that were improperly removed.
> I expect that existing files should be sequential, but there are a lot
> of holes in a listing of pg_multixact/members.
I think the holes are because a truncation was in progress and
interrupted by a 'pg_ctl stop -mi' before it completed. The files are
considered in an apparently random order, so if it is interrupted
midway through it will leave a collection of holes that would have
been closed up had it been allowed to continue.
I instrumented the code to log file deletion (attached patch) and I
see that the it starts complaining about the absence of 5183 just a
fraction of a second after it was deleted, so the deletion must have
been inappropriate.
12538 2014-06-17 12:10:02.925 PDT:LOG: JJ deleting 0xb66b20 5183
12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL: Could not open file
"pg_multixact/members/5183": No such file or directory.
12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL: Could not open file
"pg_multixact/members/5183": No such file or directory.
12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL: Could not open file
"pg_multixact/members/5183": No such file or directory.
This problem was initially fairly easy to reproduce, but since I
started adding instrumentation specifically to catch it, it has become
devilishly hard to reproduce.
I think my next step will be to also log each of the values which goes
into the complex if (...) expression that decides on the deletion.
Cheers,
Jeff
Attachment | Content-Type | Size |
---|---|---|
member_delete_log.patch | application/octet-stream | 999 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Pavan Deolasee | 2014-06-18 07:01:59 | Re: BUG #10675: alter database set tablespace and unlogged table |
Previous Message | smsiebe | 2014-06-17 17:55:11 | BUG #10680: LDAP bind password leaks to log on failed authentication |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-06-17 23:22:07 | Re: comparison operators |
Previous Message | Andrew Dunstan | 2014-06-17 23:12:39 | comparison operators |