One Big trend vs multiple smaller trends in table statistics

From: pgsql(at)mohawksoft(dot)com
To: pgsql-hackers(at)postgresql(dot)org
Subject: One Big trend vs multiple smaller trends in table statistics
Date: 2005-02-08 16:14:01
Message-ID: 16739.24.91.171.78.1107879241.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A couple of us using the US Census TIGER database have noticed something
about the statistics gathering of analyze. If you follow the thread "Query
Optimizer 8.0.1" you'll see the progression of the debate.

To summarize what I think we've seen:

The current implementation of analyze is designed around sampling a table
to characterize the basic trend of the data. The problem with the approach
is that it assumes that the data has a singular trend behavior.

Greg Stark posts "Cross column statistics" touches on the general problem.

The best analogy so far is the difference between an oscilloscope and a
spectrum analizer. The current statistics gathering is like a sampling
oscilloscope trying to display a single wave form.

Some data trends are more like audio signals where the data has many
smaller trends in a seemingly random stream. With a specrum analyzer you
can see the various components. Use Winamp or XMMS for a visualization.

Lets assume data is in a multiple sort order. Lets assume it is a set of
street addresses sorted by:

state, streetname, streettyppe, address

MA, ABBOT, RD, 100
MA, ABBOT, RD, 200
MA, ABBOT, RD, 300
MA, ABBOT, ST, 100
MA, ABBOT, ST, 200
MA, MAPLE, RD, 100
MA, MAPLE, RD, 200
MA, MAPLE, ST, 100
...
...
WY, ABBOT, RD, 100
etc.

This table has MILLIONS of rows, every single address in the country. The
"trend" of state is clearly an increasing step ramp over the entire table.
The trend of streetname can be imagined as a waveform of a series of ramps
for each state. The trend of streettype, similarly, is a series of ramps
per street name, and the wave form for address is a ramp for each
streettype.

The statistics PostgreSQL currently employs will work great for "state,"
but much less so for "streetname."

A query of "select * from addresses where streetname = 'ABBOT'" will be
seen as more expensive than it really is. Most of the ABBOTs will be
together in about 50 clusters (one for each state, assuming every state
has atlease on "ABBOT"), but the current stats are not designed to detect
this.

Yes, eventually, if the sub-trends are small enough, the index scans
become more expensive than table scans, but the current stats can't tell
where that point is. Clearly it is not at the secondary sort (or
"streetname") level.

I've found that increasing the sample size in analyze.c can help in
specific cases, but the overall problem remains.

The question is: Is this really a problem? If so, what can we do?

I was thinking of trying to compute a sliding window standard deviation
which should be able to detect smaller trends in an overall table, this
would require a lot of work in analyze.c.

If the sliding window deviation is low, then the correlation of the table
should be increased, telling the planner that an index scan is a better
choice. The actual math behind the values has to be worked out, of course,
but what do you think about the idea?

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2005-02-08 16:39:50 Re: Is there a way to make VACUUM run completely outside
Previous Message Jim Buttafuoco 2005-02-08 15:45:48 Re: float4 regression test failed on linux parisc