Re: Large Table - Slow Window Functions (Better Approach?)

From: Jeff Adams - NOAA Affiliate <jeff(dot)adams(at)noaa(dot)gov>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Large Table - Slow Window Functions (Better Approach?)
Date: 2013-03-11 15:48:17
Message-ID: CA+BdxK856M1Z8oXhQ3_PofhqwhmH-9Kzbb+p=768rpu+kB5xCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks again. The sorting does appear to be the issue. I will test out your
cursor idea...

On Mon, Mar 11, 2013 at 11:34 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:

> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff(dot)adams(at)noaa(dot)gov>:
> > Pavel,
> >
> > Thanks for the response. I have not yet had the opportunity to use
> cursors,
> > but am now curious. Could you perhaps provide a bit more detail as to
> what
> > the implementation of your suggested approach would look like?
>
> an example:
>
> $$
> DECLARE
> r record;
> prev_r record;
>
> BEGIN
> FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
> LOOP
> IF prev_r IS NOT NULL THEN
> /* do some counting */
> prev_r contains previous row, r contains current row
> do some
> RETURN NEXT .. /* return data in defined order */
> END IF;
> prev_r = r;
> END LOOP;
>
>
> Probably slow part of your query is sorting - first can be accelerated
> by index, but second (as CTE result cannot) - you can try increase
> work_mem ??
>
> Regards
>
> Pavel
>
> >
> >
> > On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com
> >
> > wrote:
> >>
> >> Hello
> >>
> >> you can try procedural solution - use a cursor over ordered data in
> >> plpgsql and returns table
> >>
> >> Regards
> >>
> >> Pavel Stehule
> >>
> >> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff(dot)adams(at)noaa(dot)gov>:
> >> > Greetings,
> >> >
> >> >
> >> >
> >> > I have a large table (~90 million rows) containing vessel positions.
> In
> >> > addition to a column that contains the location information
> (the_geom),
> >> > the
> >> > table also contains two columns that are used to uniquely identify the
> >> > vessel (mmsi and name) and a column containing the Unix time (epoch)
> at
> >> > which the position information was logged. I frequently need to assign
> >> > records to vessel transits. To do this, I currently create a CTE that
> >> > uses a
> >> > Window function (partitioning the data by mmsi and name ordered by
> >> > epoch) to
> >> > examine the time that has elapsed between successive position reports
> >> > for
> >> > individual vessels. For every position record for a vessel (as
> >> > identified
> >> > using mmsi and name), if the time elapsed between the current position
> >> > record and the previous record (using the lag function) is less than
> or
> >> > equal to 2 hours, I assign the record a value of 0 to a CTE column
> named
> >> > tr_index. If the time elapsed is greater than 2 hours, I assign the
> >> > record a
> >> > value of 1 to the tr_index column. I then use the CTE to generate
> >> > transit
> >> > numbers by summing the values in the tr_index field across a Window
> that
> >> > also partitions the data by mmsi and name and is ordered by epoch.
> This
> >> > works, but is very slow (hours). The table is indexed (multi-column
> >> > index on
> >> > mmsi, name and index on epoch). Does anyone see a way to get what I am
> >> > after
> >> > in a more efficient manner. What I am after is an assignment of
> transit
> >> > number to vessels' position records based on whether the records were
> >> > within
> >> > two hours of each other. The SQL that I used is provided below. Any
> >> > advice
> >> > would be greatly appreciated...
> >> >
> >> >
> >> >
> >> > WITH
> >> >
> >> > cte_01 AS
> >> >
> >> > (
> >> >
> >> > SELECT
> >> >
> >> > a.id,
> >> >
> >> > a.mmsi,
> >> >
> >> > a.name,
> >> >
> >> > a.epoch,
> >> >
> >> > a.the_geom
> >> >
> >> > CASE
> >> >
> >> > WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
> >> >
> >> > ELSE 0
> >> >
> >> > END AS tr_index
> >> >
> >> > FROM table a
> >> >
> >> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
> >> >
> >> > )
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > SELECT
> >> >
> >> > a.id,
> >> >
> >> > a.mmsi,
> >> >
> >> > a.name,
> >> >
> >> > a.epoch,
> >> >
> >> > a.the_geom,
> >> >
> >> > 1 + sum(a.tr_index) OVER w AS transit,
> >> >
> >> > a.active
> >> >
> >> > FROM cte_01 a
> >> >
> >> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
> >> >
> >> >
> >> >
> >> > --
> >> > Jeff
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Benjamin Krajmalnik 2013-03-11 15:53:33 The dreaded semwait on FreeBSD
Previous Message Pavel Stehule 2013-03-11 15:34:17 Re: Large Table - Slow Window Functions (Better Approach?)