From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thom Brown <thom(at)linux(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Fabrízio Mello <fabriziomello(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Error with index on unlogged table |
Date: | 2015-12-10 17:36:32 |
Message-ID: | 20151210173632.GD14789@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2015-12-10 12:19:12 -0500, Robert Haas wrote:
> On Thu, Dec 10, 2015 at 11:32 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I've, pushed the fix for the promotion related issue. I'm afraid that
> > the ALTER TABLE <unlogged-table> SET TABLESPACE issue is a bit bigger
> > than it though.
>
> I think that I would have preferred to fix this by flushing unlogged
> buffers in bulk at the end of recovery, rather than by flushing them
> as they are generated. This approach seems needlessly inefficient.
We touched on that somewhere in the thread - having to scan all of
shared buffers isn't free either, and it'd mean that promotion would
take longer because it'd do all the writes at once. As we don't fsync
them during the flush itself, and as init forks don't ever get
rewritten, I don't think it makes any actual difference? The total
number of writes to the OS is the same, no?
> Maybe it doesn't matter, but you made the same sort of decision with
> the find_multixact_start() fix: flushing sooner instead of working
> harder to make the delayed flush safe. I think that's a bad direction
> in general, and that it will bite us.
Working harder often means added complexity, and complexity for things
executed relatively infrequently has the tendency to bite. I don't think
that's a simple tradeoff.
E.g. in the find_multixact_start() case the flushing isn't just about
"itself", it's also about interactions with SlruScanDirectory. Requiring
that to also check in-memory buffers wouldn't be entirely trivial... And
all that for an option that's essentially executed infrequently.
> > The real problem there imo isn't that the copy_relation_data() doesn't
> > deal with 0 block tables, but that ATExecSetTableSpace() doesn't have a
> > unlogged table specific codepath like heap_create_with_catalog() has.
>
> It looks to me like somewhere we need to do log_smgrcreate(...,
> INIT_FORKNUM) in the unlogged table case.
Yes.
> RelationCreateStorage()
> skips this for the main forknum of an unlogged table, which seems OK,
> but there's nothing that even attempts it for the init fork, which
> does not seem OK.
We unfortunately can't trivially delegate that work to
RelationCreateStorage(). E.g. heap_create() documents that only the main
fork is created :(
> I guess that logic should happen in
> ATExecSetTableSpace just after smgrcreate(dstrel, forkNum, false).
Looks like it's the easiest place.
It sounds worthwhile to check that other locations rewriting tables,
e.g. cluster/vacuum full/reindex are safe.
> > A second problem is that the smgrimmedsync() in copy_relation_data()
> > isn't called for the init fork of unlogged relations, even if it needs
> > to.
>
> That seems easy enough to fix. Can't we just sync all forks at that
> location for permanent relations, and additionally sync the init fork
> only for unlogged relations?
Yes, it's not hard, I just wanted to mention it so it doesn't get
forgotten.
- Andres
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2015-12-10 17:37:48 | Re: Tab-comletion for RLS |
Previous Message | Robert Haas | 2015-12-10 17:21:59 | Re: Minor comment update in setrefs.c |