Doing range-predicates right (correction)

From: Mischa Sandberg <mischa(dot)sandberg(at)telus(dot)net>
To: pgsql-sql(at)postgresql(dot)org
Subject: Doing range-predicates right (correction)
Date: 2005-05-08 16:50:38
Message-ID: 1115571038.427e435ee7f3c@webmail.telus.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

A few weeks ago I posted a way to do efficient range predicate joins,
given only B-tree indexes. I've since gotten back home and looked at the
code I last used. My apologies for an offhand hasty posting.

The following is the solution I worked out when I used this method on a
large data conversion. It has the advantage (over what I posted) that
the indexable ranges fit more tightly around the actual ranges --- a
pseudo-range will never be more than twice as wide as the original range.

For those who didn't see the original, the problem was, how do you get
efficient lookups against large tables of the form:

TABLE Ranges(rangeid <sometype>, lower INT, upper INT)

... or any range-bound type that can be mapped to int (e.g. by mapping
timestamps to epoch-seconds).

... with queries like

select rangeid, sampleid
from Samples
join Ranges on Samples.val between Ranges.lower and Ranges.upper

The problem is that a Btree index on Ranges.lower (or .upper) is
ineffective; in most cases, the query planner will rightly ignore it.

One (my) solution is to map the given ranges into slightly larger
ranges, that have a small number of different widths, and all start on a
more limited set of boundaries.

One way to do this is to map all ranges to ranges that have a width that
is a power of two, and that begin on a whole multiple of that power of two.

Unfortunately, if you just map a range (width) to the smallest power of
2 greater than (upper-lower), then lower and upper may be in two
different ranges of that width. For example, if your original range is
[1003..1030] (i.e. width 28), the power of 2 that covers this range is
32, but 1003 is in the range [992..1023] and 1030 is the one above it. A
sloppy fix for this is to take the next higher power of two as the
pseudo-width.

The original solution created a new version of Ranges that had as many
rows as the original Ranges table. The following solution creates a new
version with no more than twice as many rows, but with.

-- Function to return the lowest power of two greater than
-- a given inclusive interval:

create or replace function width2(int, int) returns int
immutable language 'plpgsql' as '
begin
return pow(2, ceil(log(2, $2-$1+1)))::int;
end';

-- Construct an augmented Ranges table:

select rangeid, lower, upper, width,
start-mod(start, width) as start
into PseudoRange
from ( select rangeid, lower, upper, start,
width2(start, finish) as width
from (
select rangeid, lower, upper,
lower as start,
upper-mod(upper, width2(lower,upper))-1 as finish
from Range
union all
select rangeid, lower, upper,
upper-mod(upper, width2(lower,upper)) as start,
upper as finish
from Range
) as X where start <= finish
) as Y;

create unique index PR_start_width on PseudoRange(start,width);

The query using PseudoRange also uses a table PseudoWidth.
If (lower) and (upper) are ints, this table can at most have 31 rows
with values (1,2,4,8,...). This can also be calculated by:

select distinct width into PseudoWidth from PseudoRange;

... which will have fewer values, for proportionately faster lookups.

The lookup query becomes:

select rangeid, sampleid
from Samples, PseudoRange as PR, PseudoWidth as PW
where PR.width = PW.width
and PR.start = Samples.val - (Samples.val % PW.width)
and Samples.val between PR.lower and PR.upper

The usual query plan will do at most (!) 31 index lookups per Sample
row. If this is unacceptable, a power greater than 2 can be used.
--
"Dreams come true, not free." -- S.Sondheim

Browse pgsql-sql by date

  From Date Subject
Next Message Aarni Ruuhimäki 2005-05-08 17:20:17 Re: encoding
Previous Message Tom Lane 2005-05-08 16:43:30 Re: encoding