Re: optimize self-join query

From: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
To: Ty Busby <tybusby(at)gmail(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: optimize self-join query
Date: 2011-10-28 05:12:41
Message-ID: CANnCtn+_P7=_y4aLHJCAJCvcR_4korJ7fN-r8yME0XYSDVCOvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Tue, Oct 25, 2011 at 2:37 PM, Ty Busby <tybusby(at)gmail(dot)com> wrote:

> I have a table that stores a very large starting number called
> epc_start_numeric and a quantity. I've apparently built the most
> inefficient query possible for doing the job I need: find out if any records
> overlap. Imagine the epc_start_numeric + quantity representing a block of
> numbers. I need to find out if any of these blocks overlap.
>
> Here is the table:
>
> CREATE TABLE ftp_epc_audit
> (
> record_id serial NOT NULL,
> sent_filename text NOT NULL,
> pcid text NOT NULL,
> tsid text NOT NULL,
> user_sec_id text NOT NULL,
> format_set text NOT NULL,
> format text NOT NULL,
> epc_start text NOT NULL,
> quantity integer,
> epc_decimal_start numeric(29)
> )
> WITH OIDS;
> ALTER TABLE ftp_epc_audit OWNER TO postgres;
>

Not having access to your dataset, I created the following test data:

DROP TABLE IF EXISTS ftp_epc_audit;
CREATE TABLE ftp_epc_audit (
record_id serial NOT NULL,
quantity integer,
epc_decimal_start numeric(29)
)
WITH OIDS;
ALTER TABLE ftp_epc_audit ADD PRIMARY KEY(record_id);

INSERT INTO ftp_epc_audit (epc_decimal_start, quantity)
SELECT
generate_series(5, 10000, 5) AS epc_decimal_start,
round(ceiling(20 * random()) / ceiling(5 * random()))::int AS quantity
;

The strange final column creates something like a stepped Poisson
distribution, most of the values of quantity are < 5 but they can go as high
as 20. Since epc_decimal_start increases in steps of five, this ensures we
have a many records with nonoverlapping ranges and many that overlap one,
two, or three adjacent ranges.

>
> And the query is currently this:
>
> SELECT '', 0, a.*, '', 0, b.* FROM ftp_epc_audit_staging a,
> ftp_epc_audit_staging b
> WHERE a.sent_filename <> b.sent_filename
> AND a.quantity > 0
> AND b.quantity > 0
> AND a.locked = 1
> AND b.locked = 1
> AND(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND
> b.epc_decimal_start + b.quantity - 1 )
> OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND
> b.quantity - 1 )
> OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND
> a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1
> ))
>

I rewrote this to run against my pared down 2000 record test table as:

SELECT
*
FROM
ftp_epc_audit a CROSS JOIN ftp_epc_audit b
WHERE
a.record_id <> b.record_id AND
(( a.epc_decimal_start BETWEEN b.epc_decimal_start AND b.epc_decimal_start +
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 BETWEEN b.epc_decimal_start AND
b.quantity - 1 )
OR ( a.epc_decimal_start + a.quantity - 1 < b.epc_decimal_start AND
a.epc_decimal_start + a.quantity -1 > b.epc_decimal_start + b.quantity - 1
))
;

These condition inside AND ((…)) looks unnecessarily complex. In fact, the
last OR (…) clause can't ever evaluate to TRUE, but it does hurt the query
performance.

If x = b.epc_decimal_start
and y = b.epc_decimal_start + b.quantity - 1
and quantity is an integer restricted by an earlier clause to be > 0
Then x ≤ y
Therefore there is no z such that z < x and z > y

Testing this on the 2000 record table, dropping that clause alone made the
query time drop from ~20 seconds to ~12 seconds. I also confirmed that
dropping that clause returns the same number of records.

The final four lines can be further simplified to

…AND
a.epc_decimal_start < b.epc_decimal_start
AND b.epc_decimal_start < a.epc_decimal_start + a.quantity

This says, for each record, compare only the records with a greater range
start, and return any record whose range starts before my range stops. This
also returns the same number of records as the previous version and on the
2000 record table runs in ~2.6 seconds.

There are further speed gains from using a WITH clause so that the range
limits are only evaluated once (~2.0 seconds), and in using a window
function to ORDER the recordset by range start, then compare the rank
instead of the range start (~1.5 seconds). It ends up looking like this:

WITH rank_epc AS (
SELECT
record_id,
epc_decimal_start AS range_low,
epc_decimal_start + quantity AS range_high,
rank() OVER (ORDER BY epc_decimal_start)
FROM
ftp_epc_audit
)
SELECT
*
FROM
rank_epc a CROSS JOIN rank_epc b WHERE a.rank < b.rank AND b.range_low <
a.range_high
;

This ran on the 2000 record table in ~1.5 seconds and on a 20,000 record
dataset in 172 seconds.

Finally, depending on your use case and what you want to do once you
identify the problematic records, it might be useful to only compare
adjacent ranges to see if they overlap. If adjacent ranges are then "fixed"
so that they don't overlap anymore, there won't be any left that overlap two
or three or more ranges. This can be accomplished with:

WITH rank_epc AS (
SELECT
record_id,
epc_decimal_start AS range_low,
epc_decimal_start + quantity AS range_high,
rank() OVER (ORDER BY epc_decimal_start)
FROM
ftp_epc_audit
)
SELECT
*
FROM
rank_epc a JOIN rank_epc b ON (a.rank = b.rank - 1 AND b.range_low <
a.range_high)
;

This ran on the 20,000 record dataset in 300 milliseconds.

Hope this helps.

--Lee

--
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Pavel Stehule 2011-10-28 05:13:26 Re: Different order by behaviour depending on where clause?
Previous Message Jan Bakuwel 2011-10-28 04:53:36 Different order by behaviour depending on where clause?