Re: That EXPLAIN ANALYZE patch still needs work

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: That EXPLAIN ANALYZE patch still needs work
Date: 2006-06-06 22:17:30
Message-ID: 23740.1149632250@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> This doesn't make any sense at all. How can a sampling run that only
> sampled 7% of the actual tuples, end up with a actual measured time
> that's more than 50% of the actual final runtime?

AFAICS, the only conclusion is that the sampled executions are in fact
taking longer than supposedly-equivalent unsampled ones, and by a pretty
good percentage too. I'm growing unsure again about the mechanism
responsible for that, however.

> Can you get an estimate of the actual overhead (ie compare wall clock
> time of EXPLAIN ANALYZE vs actual query). The cost of gettimeofday() is
> on the order of a few hundred cycles, I'm not sure about the cost of
> TLB flushes (that's probably highly architechture dependant).

Here's some examples. Keep in mind I've already determined that
gettimeofday() takes about 3 usec on this hardware ...

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=797.412..797.416 rows=1 loops=1)
-> Limit (cost=825.00..2484.00 rows=10000 width=488) (actual time=208.208..2576.528 rows=10000 loops=1)
-> Hash Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=208.190..2082.577 rows=10000 loops=1)
Hash Cond: (a.unique1 = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.082..3.718 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=207.933..207.933 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.017..3.583 rows=10000 loops=1)
Total runtime: 805.036 ms
(8 rows)

Time: 816.463 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
count
-------
10000
(1 row)

Time: 816.970 m

The actual elapsed time for EXPLAIN ANALYZE seems to jump around quite
a bit, probably because of the random variation we're using in sampling
interval. This particular combination was unusually close. But in any
case, the *actual* overhead of EXPLAIN ANALYZE is clearly pretty small
here; the problem is that we're incorrectly extrapolating the measured
runtime to the unmeasured executions.

What's especially interesting is that the excess time doesn't seem to
show up if I form the query in a way that doesn't require pushing as
much data around:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1341.00..1341.01 rows=1 width=0) (actual time=212.313..212.317 rows=1 loops=1)
-> Hash Join (cost=483.00..1316.00 rows=10000 width=0) (actual time=88.061..160.886 rows=10000 loops=1)
Hash Cond: (a.unique1 = b.unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=4) (actual time=0.071..4.068 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=4) (actual time=87.862..87.862 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=4) (actual time=0.031..4.780 rows=10000 loops=1)
Total runtime: 221.022 ms
(7 rows)

Time: 229.377 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2) ss;
count
-------
10000
(1 row)

Time: 202.531 ms
regression=#

(Without the OFFSET 0, the planner flattens the subquery and discovers
that it doesn't actually need to fetch any of the non-join-key table
columns.) Note the only plan nodes showing whacked-out timings are the
ones returning wide tuples (large "width" values). I'm not entirely
sure what to make of this. It could be interpreted as evidence for my
theory about TLB reloads during userspace data access being the problem.
But I'm getting a bit disenchanted with that theory after running the
same test case in 8.1:

regression=# explain analyze select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2609.00..2609.01 rows=1 width=0) (actual time=1033.866..1033.870 rows=1 loops=1)
-> Limit (cost=825.00..2484.00 rows=10000 width=488) (actual time=330.961..989.272 rows=10000 loops=1)
-> Hash Join (cost=825.00..2484.00 rows=10000 width=488) (actual time=330.946..920.398 rows=10000 loops=1)
Hash Cond: ("outer".unique1 = "inner".unique2)
-> Seq Scan on tenk1 a (cost=0.00..458.00 rows=10000 width=244) (actual time=0.050..66.861 rows=10000 loops=1)
-> Hash (cost=458.00..458.00 rows=10000 width=244) (actual time=330.745..330.745 rows=10000 loops=1)
-> Seq Scan on tenk1 b (cost=0.00..458.00 rows=10000 width=244) (actual time=0.010..65.890 rows=10000 loops=1)
Total runtime: 1041.293 ms
(8 rows)

Time: 1051.594 ms
regression=# select count(*) from (select * from tenk1 a join tenk1 b on a.unique1 = b.unique2 offset 0) ss;
count
-------
10000
(1 row)

Time: 870.348 ms

Both runtimes are pretty repeatable. That says that on this example,
where we have sampled 40000 node executions (remember HASH is only
called once), the overhead of EXPLAIN ANALYZE is about 4.5 usec per
node, which is not very far from twice the gettimeofday() time;
it certainly doesn't show any evidence for a huge additional overhead
due to TLB flushes.

One thing I was wondering about was whether the samplings of upper nodes
might be correlating to a greater-than-random extent with the samplings
of lower nodes. That would result in the observed problem of
overestimating upper nodes. It doesn't explain the apparent dependence
on targetlist width, though. So I'm confused again.

> To be honest, I wonder about caching effects, but for the disks.

All of the above are repeated executions that should be fully cached
in RAM.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-06-06 22:34:03 Re: [HACKERS] Win32 sysconfig -> pg_service.conf
Previous Message Tom Lane 2006-06-06 21:28:17 Re: SERIAL problems?