Re: slower merge join on sorted data chosen over

From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: slower merge join on sorted data chosen over
Date: 2005-10-17 19:55:03
Message-ID: 20051017195503.GR86144@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 11, 2005 at 10:58:58AM +0100, Simon Riggs wrote:
> On Mon, 2005-10-10 at 15:14 -0500, Kevin Grittner wrote:
> > We are looking at doing much more with PostgreSQL over the
> > next two years, and it seems likely that this issue will come up
> > again where it is more of a problem. It sounded like there was
> > some agreement on HOW this was to be fixed, yet I don't see
> > any mention of doing it in the TODO list.
>
> > Is there any sort of
> > estimate for how much programming work would be involved?
>
> The main work here is actually performance testing, not programming. The
> cost model is built around an understanding of the timings and costs
> involved in the execution.
>
> Once we have timings to cover a sufficiently large range of cases, we
> can derive the cost model. Once derived, we can program it. Discussing
> improvements to the cost model without test results is never likely to
> convince people. Everybody knows the cost models can be improved, the
> only question is in what cases? and in what ways?
>
> So deriving the cost model needs lots of trustworthy test results that
> can be assessed and discussed, so we know how to improve things. [...and
> I don't mean 5 minutes with pg_bench...]
>
> Detailed analysis such as that is time consuming and also needs to be
> done in a sufficiently reproducible manner that we can rely on it.
>
> Your help would be greatly appreciated in that area. You and your team
> clearly have an eye for the fine detail of these issues.
>
> ...IIRC there is a TODO item relating to that.
>
> Perhaps we should put a more general call out on the TODO list towards
> detailed, complete, accurate and reproducible performance test results?

I touched on some of this in
http://archives.postgresql.org/pgsql-performance/2005-05/msg00336.php:

"In terms of a testing system, here's what I'm thinking of. For each cost
estimate, there will be a number of input variables we want to vary, and
then check to see how changes in them effect run time. Using index scan
as a simple example, 1st order variables will likely be index and table
size (especially in relation to cache size), and correlation. So, we
need some kind of a test harness that can vary these variables
(prefferably one at a time), and run a test case after each change. It
would then need to store the timing info, possibly along with other
information (such as explain output). If I'm the one to write this it'll
end up in perl, since that's the only language I know that would be able
to accomplish this. DBT seems to be a reasonable test database to do
this testing with, especially since it would provide a ready means to
provide external load."

Of course, this work can be done by hand, but it's a very slow, tedeous
process. It's also rather error-prone.

There's been some discussion on the osdl-dbt mailing lists about
providing a front-end that would allow for scheduling tests and storing
results in a database where you could data-mine more easily than you
currently can (currently everything's just stored as files on a disk
somewhere). ISTM that having that would make this kind of testing much
easier to do. But I've also been working with dbt quite a bit recently,
so it's my hammer that makes everything performance related look like a
nail...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2005-10-17 20:30:24 Re: slower merge join on sorted data chosen over
Previous Message Jim C. Nasby 2005-10-17 19:17:20 Re: slower merge join on sorted data chosen over nested loop