From: | Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> |
---|---|
To: | pgsql(at)mohawksoft(dot)com |
Cc: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Mark Kirkwood <markir(at)coretech(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Query optimizer 8.0.1 (and 8.0) |
Date: | 2005-02-14 17:55:38 |
Message-ID: | 4210E61A.7090503@cheapcomplexdevices.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
pgsql(at)mohawksoft(dot)com wrote:
>
> You know, I don't think a lot of people "get" the issues I was describing,
> or maybe they don't believe it, I don't know, but, I think that it would
> be a useful contrib project to create an 'analyze_special('table',
> 'column', 'method')' function that does a better job at calculating the
> stats for table that contain multiple trend waveforms. A separate function
> will probably work well as the trends within the data probably only apply
> to specific rows.
I've done something similar, but simpler for the Census/TIGER data.
If you loaded each TIGER file sequentially, like I did, the data
was all grouped by county when it was loaded - so basically all
the geographical columns (zip, county, state, census-tract) are
actually grouped tightly on disk -- though ANALYZE can't see this
because they're not strictly ascending or descending.
Since I merely observed the geospatial columns were all
clustered pretty well, I merely set the correlation
value to the same pretty large value for all the
geometric rows with a bunch of statements like this:
update pg_statistic
set stanumbers3[1] = 0.8
where starelid = 31412043
and staattnum=3;
Instead of a complicated analyze function, how about just
letting the user "tell" the optimizer that a column is
clustered well with a function like:
force_correlation_stat('schema', 'table', 'column', 'value')
would actually work well for your data. Since you
know your distinct values lay on a relatively small
number of pages if you merely did:
force_correlation('tiger','rt1','zipl',0.8);
force_correlation('tiger','rt1','statel',0.8);
force_correlation('tiger','rt1','countyl',0.8);
the optimizer would then see that not many disk
pages would need to be hit for a single zip code.
> It's interesting, because I don't think it needs to calculate a perfect
> representation of the data so much as better clue to its nature for the
> optimizer.
Indeed. Using the very arbitrary number "0.8" for the
correlation, for all the geographic-related columns in the
tiger data, the optimizer guessed a good plan almost every
time on my company's 200GB geographical database.
> When I get the time (or can get someone to pay me to do it) I'm going to
> try it.
I still suspect that the correct way to do it would not be
to use the single "correlation", but 2 stats - one for estimating
how sequential/random accesses would be; and one for estimating
the number of pages that would be hit. I think the existing
correlation does well for the first estimate; but for many data
sets, poorly for the second type.
If you want to start a contrib project that looks into additional
stats that may help, I might be interested.
Ron
From | Date | Subject | |
---|---|---|---|
Next Message | Doug McNaught | 2005-02-14 17:56:45 | Re: Help me recovering data |
Previous Message | Christopher Kings-Lynne | 2005-02-14 17:47:10 | Re: Help me recovering data |