From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Lukas Fittl <lukas(at)fittl(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc? |
Date: | 2022-07-01 17:26:39 |
Message-ID: | 20220701172639.ty4iu5almspueriu@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> > The attached patches are really just a prototype. I'm also not really
> > planning to work on getting this into a "production ready" patchset
> > anytime soon. I developed it primarily because I found it the overhead
> > made it too hard to nail down in which part of a query tree performance
> > changed. If somebody else wants to continue from here...
> >
> > I do think it's be a pretty significant improvement if we could reduce
> > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> > bunch of low-level code.
> >
>
> Based on an off-list conversation with Andres, I decided to dust off this
> old
> patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
> improvements (especially when using rdtsc instead of rdtsc*p*) seem to
> warrant
> giving this a more thorough look.
>
> See attached an updated patch (adding it to the July commitfest), with a few
> changes:
>
> - Keep using clock_gettime() as a fallback if we decide to not use rdtsc
Yep.
> - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.
> - In an abundance of caution, for now I've decided to only enable this if we
> are on Linux/x86, and the current kernel clocksource is TSC (the kernel
> has
> quite sophisticated logic around making this decision, see [1])
I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...
> Note that if we implemented the decision logic ourselves (instead of relying
> on the Linux kernel), I'd be most worried about older virtualization
> technology. In my understanding getting this right is notably more
> complicated
> than just checking cpuid, see [2].
> Known WIP problems with this patch version:
>
> * There appears to be a timing discrepancy I haven't yet worked out, where
> the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> higher for \timing than for the EXPLAIN ANALYZE time reported on the
> server
> side, only when rdtsc measurement is used -- its likely there is a problem
> somewhere with how we perform the cycles to time conversion
Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.
> * Possibly related, the floating point handling for the cycles_to_sec
> variable
> is problematic in terms of precision (see FIXME, taken over from Andres'
> POC)
And probably also performance...
> Open questions from me:
>
> 1) Do we need to account for different TSC offsets on different CPUs in SMP
> systems? (the Linux kernel certainly has logic to that extent, but [3]
> suggests this is no longer a problem on Nehalem and newer chips, i.e.
> those
> having an invariant TSC)
I don't think we should cater to systems where we need that.
> 2) Should we have a setting "--with-tsc" for configure? (instead of always
> enabling it when on Linux/x86 with a TSC clocksource)
Probably not worth it.
> 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
> current instructions to finish -- the prior discussion seemed to suggest
> we don't want it for node instruction measurements, but possibly we do
> want
> this in other cases?)
I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?
> 4) Should we support using the "mrs" instruction on ARM? (which is similar
> to
> rdtsc, see [4])
I'd leave that for later personally.
> #define NS_PER_S INT64CONST(1000000000)
> #define US_PER_S INT64CONST(1000000)
> #define MS_PER_S INT64CONST(1000)
> @@ -95,17 +104,37 @@ typedef int64 instr_time;
>
> #define INSTR_TIME_SET_ZERO(t) ((t) = 0)
>
> -static inline instr_time pg_clock_gettime_ns(void)
> +extern double cycles_to_sec;
> +
> +bool use_rdtsc;
This should be extern and inside the ifdef below.
> +#if defined(__x86_64__) && defined(__linux__)
> +extern void pg_clock_gettime_initialize_rdtsc(void);
> +#endif
> +
> +static inline instr_time pg_clock_gettime_ref_cycles(void)
> {
> struct timespec tmp;
>
> +#if defined(__x86_64__) && defined(__linux__)
> + if (use_rdtsc)
> + return __rdtsc();
> +#endif
> +
> clock_gettime(PG_INSTR_CLOCK, &tmp);
>
> return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
> }
>
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2022-07-01 17:30:16 | Re: EINTR in ftruncate() |
Previous Message | Mark Wong | 2022-07-01 17:18:20 | Re: real/float example for testlibpq3 |