Re: Large Table - Slow Window Functions (Better Approach?)

From: Jeff Adams - NOAA Affiliate <jeff(dot)adams(at)noaa(dot)gov>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Large Table - Slow Window Functions (Better Approach?)
Date: 2013-03-12 13:08:31
Message-ID: CA+BdxK8uH4ex7nL3FUF9r6PjgD-Ut0-rLkJT7p6FyJA=21Kdbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

So, I tested out the cursor approach, and it still chugs along for hours.
If the result set is large (and the available memory to process small),
does it matter what goes on within the cursor. Will it still choke trying
assemble and spit out the large result set?

On Mon, Mar 11, 2013 at 11:48 AM, Jeff Adams - NOAA Affiliate <
jeff(dot)adams(at)noaa(dot)gov> wrote:

> Thanks again. The sorting does appear to be the issue. I will test out
> your cursor idea...
>
>
> On Mon, Mar 11, 2013 at 11:34 AM, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>wrote:
>
>> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff(dot)adams(at)noaa(dot)gov>:
>> > Pavel,
>> >
>> > Thanks for the response. I have not yet had the opportunity to use
>> cursors,
>> > but am now curious. Could you perhaps provide a bit more detail as to
>> what
>> > the implementation of your suggested approach would look like?
>>
>> an example:
>>
>> $$
>> DECLARE
>> r record;
>> prev_r record;
>>
>> BEGIN
>> FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
>> LOOP
>> IF prev_r IS NOT NULL THEN
>> /* do some counting */
>> prev_r contains previous row, r contains current row
>> do some
>> RETURN NEXT .. /* return data in defined order */
>> END IF;
>> prev_r = r;
>> END LOOP;
>>
>>
>> Probably slow part of your query is sorting - first can be accelerated
>> by index, but second (as CTE result cannot) - you can try increase
>> work_mem ??
>>
>> Regards
>>
>> Pavel
>>
>> >
>> >
>> > On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <
>> pavel(dot)stehule(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> Hello
>> >>
>> >> you can try procedural solution - use a cursor over ordered data in
>> >> plpgsql and returns table
>> >>
>> >> Regards
>> >>
>> >> Pavel Stehule
>> >>
>> >> 2013/3/11 Jeff Adams - NOAA Affiliate <jeff(dot)adams(at)noaa(dot)gov>:
>> >> > Greetings,
>> >> >
>> >> >
>> >> >
>> >> > I have a large table (~90 million rows) containing vessel positions.
>> In
>> >> > addition to a column that contains the location information
>> (the_geom),
>> >> > the
>> >> > table also contains two columns that are used to uniquely identify
>> the
>> >> > vessel (mmsi and name) and a column containing the Unix time (epoch)
>> at
>> >> > which the position information was logged. I frequently need to
>> assign
>> >> > records to vessel transits. To do this, I currently create a CTE that
>> >> > uses a
>> >> > Window function (partitioning the data by mmsi and name ordered by
>> >> > epoch) to
>> >> > examine the time that has elapsed between successive position reports
>> >> > for
>> >> > individual vessels. For every position record for a vessel (as
>> >> > identified
>> >> > using mmsi and name), if the time elapsed between the current
>> position
>> >> > record and the previous record (using the lag function) is less than
>> or
>> >> > equal to 2 hours, I assign the record a value of 0 to a CTE column
>> named
>> >> > tr_index. If the time elapsed is greater than 2 hours, I assign the
>> >> > record a
>> >> > value of 1 to the tr_index column. I then use the CTE to generate
>> >> > transit
>> >> > numbers by summing the values in the tr_index field across a Window
>> that
>> >> > also partitions the data by mmsi and name and is ordered by epoch.
>> This
>> >> > works, but is very slow (hours). The table is indexed (multi-column
>> >> > index on
>> >> > mmsi, name and index on epoch). Does anyone see a way to get what I
>> am
>> >> > after
>> >> > in a more efficient manner. What I am after is an assignment of
>> transit
>> >> > number to vessels' position records based on whether the records were
>> >> > within
>> >> > two hours of each other. The SQL that I used is provided below. Any
>> >> > advice
>> >> > would be greatly appreciated...
>> >> >
>> >> >
>> >> >
>> >> > WITH
>> >> >
>> >> > cte_01 AS
>> >> >
>> >> > (
>> >> >
>> >> > SELECT
>> >> >
>> >> > a.id,
>> >> >
>> >> > a.mmsi,
>> >> >
>> >> > a.name,
>> >> >
>> >> > a.epoch,
>> >> >
>> >> > a.the_geom
>> >> >
>> >> > CASE
>> >> >
>> >> > WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
>> >> >
>> >> > ELSE 0
>> >> >
>> >> > END AS tr_index
>> >> >
>> >> > FROM table a
>> >> >
>> >> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >> >
>> >> > )
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > SELECT
>> >> >
>> >> > a.id,
>> >> >
>> >> > a.mmsi,
>> >> >
>> >> > a.name,
>> >> >
>> >> > a.epoch,
>> >> >
>> >> > a.the_geom,
>> >> >
>> >> > 1 + sum(a.tr_index) OVER w AS transit,
>> >> >
>> >> > a.active
>> >> >
>> >> > FROM cte_01 a
>> >> >
>> >> > WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jeff
>>
>

--
Jeffrey D. Adams
Contractor
OAI, Inc.
In support of:
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Victor Yegorov 2013-03-12 13:36:54 Re: Large Table - Slow Window Functions (Better Approach?)
Previous Message Misa Simic 2013-03-12 03:55:26 Slow concurrent processing