From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Peter Eisentraut <peter_e(at)gmx(dot)net> |
Cc: | pgsql-hackers(at)postgresql(dot)org, "Stephen R(dot) van den Berg" <srb(at)cuci(dot)nl> |
Subject: | Re: Is it really such a good thing for newNode() to be a macro? |
Date: | 2008-08-29 19:16:49 |
Message-ID: | 26133.1220037409@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Tom Lane wrote:
>> I considered that one, but since part of my argument is that inlining
>> this is a waste of code space, it seems like a better inlining
>> technology isn't really the answer.
> The compiler presumably has the intelligence and the command-line options to
> control how much inlining one wants to do. But without any size vs.
> performance measurements it is an idle discussion. Getting rid of a global
> variable and macro ugliness is a worthwhile goal of its own.
I got around to doing some experiments. The method suggested by Heikki
(out-of-line subroutine for everything except the MemSetTest) reduces
the size of the backend executable by about 0.5% (about 20K) in CVS HEAD
on Fedora 9 x86_64, in a non-assert-enabled build. However it also
makes it measurably slower. I couldn't detect any difference in a
regular pgbench run, so instead I timed iterations of this:
explain select * from tenk1 a join tenk1 b using(unique1)
join tenk1 c on a.unique1 = c.unique2
join tenk1 d on a.unique1 = d.thousand
join tenk1 e on a.unique1 = e.ten
join tenk1 f on a.unique1 = f.tenthous
join tenk1 g on a.unique1 = g.unique2
where exists(select 1 from int4_tbl where f1 = b.unique2);
in the regression database. Put the above (as a single line!) into
"explainjoin.sql" and try
pgbench -c 1 -t 100 -n -f explainjoin.sql regression
This is mostly stressing the planner, which is pretty newNode-heavy.
I get consistently about 14.1 tps on straight CVS HEAD and about 13.8
with the partially out-of-line implementation.
I also tried the "static inline" implementation, but that doesn't work
at all: gcc refuses to inline it, which makes the palloc0fast call a
dead loss.
So indeed what we need here is a better inlining technology. I looked
into using gcc's "a compound statement enclosed in parentheses is an
expression" extension, thus:
#define newNode(size, tag) \
({ Node *newNodeMacroHolder; \
AssertMacro((size) >= sizeof(Node)); /* need the tag, at least */ \
newNodeMacroHolder = (Node *) palloc0fast(size); \
newNodeMacroHolder->type = (tag); \
newNodeMacroHolder; \
})
This gets rid of the global, but incredibly, it's even slower: 13.5 tps
on the explain test. I do not understand that result. I looked at the
generated machine code to verify that it was what I expected, and indeed
it's about the same as CVS HEAD except that there's no store-and-fetch
into a global.
Getting rid of the global variable accesses reduces the size of the
backend by about 12K on this architecture, and the only theory I can
think of is that that moves things around enough to make the instruction
cache less efficient on some code path that this test happens to
exercise heavily.
In theory the above implementation of newNode should be a clear win,
so I'm thinking this result must be an artifact of some kind. I'm
going to go try it on PPC and HPPA machines next; does anyone want to
try it on something else?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Treat | 2008-08-29 21:01:17 | Re: [patch] GUC source file and line number |
Previous Message | D'Arcy J.M. Cain | 2008-08-29 18:32:01 | Re: Proposal: new border setting in psql |