Re: Improve PostGIS performance with 62 million rows?

From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
Cc: Israel Brewster <israel(at)ravnalaska(dot)net>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Improve PostGIS performance with 62 million rows?
Date: 2017-01-05 22:38:07
Message-ID: CAJvUf_vckEcm3+cDNaGDNvjcFDZ3yBumYL8gz5ahvNHnoKioxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey,
1 sec seems really good in this case,
and I'm assuming you tuned postgres so the main index fits into ram
(work_mem and all other stuff).

You could avoid a CTE by mixing both cte.

WITH pts AS (
SELECT (pt).geom, (pt).path[1] as vert
FROM
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Then you could remove the useless and (potentially explosive if you have
large number of dump points) inner join on points :
"FROM pts a
INNER JOIN pts b "

You could simply use a window function to generate the segments, like in
here
<https://github.com/Remi-C/PPPP_utilities/blob/master/postgis/rc_DumpSegments.sql#L51>
.
The idea is to dump points, order them by path, and then link each point
with the previous one (function lag).
Assuming you don't want to use the available function,
this would be something like :

WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom) AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL --the first segment is null by design
(lag function)
AND ST_DWithin(location, segments.short_line, 600) = TRUE
ORDER BY elevation DESC
limit 1;

I don't know if you can further improve this query after that,
but I'll guess it would reduce your time and be more secure regarding
scaling.

if you want to further improve your result,
you'll have to reduce the number of row in your index,
that is partition your table into several tables !

This is not easy to do with current postgres partitionning methods as far
as I know
(partitionning is easy, automatic efficient query is hard).

Another way would be to reduce you requirement, and consider that in some
case you may want less details in the altimetry, which would allow you to
use a Level Of Detail approach.

Congrats for the well explained query/problem anyway !
Cheers,
Rémi-C

2017-01-05 23:09 GMT+01:00 Paul Ramsey <pramsey(at)cleverelephant(dot)ca>:

> Varying the segment length upwards might have a salutary effect for a
> while, as the efficiency improvement of fewer inner loops battles with the
> inefficiency of having more points selected by the index filter. Worth an
> experiment.
>
> P
>
> On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel(at)ravnalaska(dot)net>
> wrote:
>
>>
>> On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
>> wrote:
>>
>> Yes, you did. You want a query that spits out a tupleset of goemetries
>> (one each for each wee segment), and then you can join that set to your
>> main table using st_dwithin() as the join clause.
>> So start by ditching the main table and just work on a query that
>> generates a pile of wee segments.
>>
>>
>> Ahhh, I see you've done this sort of thing before (
>> http://blog.cleverelephant.ca/2015/02/breaking-linestring-
>> into-segments.html) :-)
>>
>> So following that advice I came up with the following query:
>>
>> WITH dump AS (SELECT
>> ST_DumpPoints(
>> ST_Segmentize(
>> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
>> 61.179167,-156.77 71.285833)'),
>> 600
>> )::geometry
>> ) as pt
>> ),
>> pts AS (
>> SELECT (pt).geom, (pt).path[1] as vert FROM dump
>> )
>> SELECT elevation
>> FROM data
>> INNER JOIN (SELECT
>> ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
>> FROM pts a
>> INNER JOIN pts b
>> ON a.vert=b.vert-1 AND b.vert>1) segments
>> ON ST_DWithin(location, segments.short_line, 600)
>> ORDER BY elevation DESC limit 1;
>>
>> Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/
>> RsTD <https://explain.depesz.com/s/ukwc>):
>>
>>
>> QUERY PLAN
>>
>>
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> --------------------------------------------------------
>> Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual
>> time=1171.814..1171.814 rows=1 loops=1)
>> CTE dump
>> -> Result (cost=0.00..5.25 rows=1000 width=32) (actual
>> time=0.024..1.989 rows=1939 loops=1)
>> CTE pts
>> -> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual
>> time=0.032..4.071 rows=1939 loops=1)
>> -> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual
>> time=1171.813..1171.813 rows=1 loops=1)
>> Sort Key: data.elevation DESC
>> Sort Method: top-N heapsort Memory: 25kB
>> -> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4)
>> (actual time=0.590..1167.615 rows=28408 loops=1)
>> -> Nested Loop (cost=0.00..8357.50 rows=1665 width=64)
>> (actual time=0.046..663.475 rows=1938 loops=1)
>> Join Filter: (a.vert = (b.vert - 1))
>> Rows Removed by Join Filter: 3755844
>> -> CTE Scan on pts b (cost=0.00..22.50 rows=333
>> width=36) (actual time=0.042..0.433 rows=1938 loops=1)
>> Filter: (vert > 1)
>> Rows Removed by Filter: 1
>> -> CTE Scan on pts a (cost=0.00..20.00 rows=1000
>> width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
>> -> Index Scan using location_gix on
>> data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256
>> rows=15 loops=1938)
>> Index Cond: (location &&
>> _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography,
>> '600'::double precision))
>> Filter: (((st_makeline(ARRAY[a.geom,
>> b.geom]))::geography && _st_expand(location, '600'::double precision)) AND
>> _st_dwithin(location, (st_makeline(ARRAY[a.geom,
>> b.geom]))::geography, '600'::double precision, true))
>> Rows Removed by Filter: 7
>> Planning time: 4.318 ms
>> Execution time: 1171.994 ms
>> (22 rows)
>>
>> So not bad. Went from 20+ seconds to a little over 1 second. Still
>> noticeable for a end user, but defiantly usable - and like mentioned,
>> that's a worst-case scenario query. Thanks!
>>
>> Of course, if you have any suggestions for further improvement, I'm all
>> ears :-)
>> -----------------------------------------------
>> Israel Brewster
>> Systems Analyst II
>> Ravn Alaska
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293
>> -----------------------------------------------
>>
>>
>> On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel(at)ravnalaska(dot)net>
>> wrote:
>>
>>> On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey(at)cleverelephant(dot)ca>
>>> wrote:
>>>
>>>
>>> The index filters using bounding boxes. A long, diagonal route will
>>> have a large bounding box, relative to the area you actually care about
>>> (within a narrow strip of the route). Use ST_Segmentize() to add points to
>>> your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
>>> generate new lines from those points, each line very short. The maximum
>>> index effectiveness will come when your line length is close to your buffer
>>> width.
>>>
>>> P
>>>
>>>
>>> Ok, I think I understand the concept. So attempting to follow your
>>> advice, I modified the query to be:
>>>
>>> SELECT elevation
>>> FROM data
>>> WHERE
>>> ST_DWithin(
>>> location,
>>> (SELECT ST_MakeLine(geom)::geography as split_line
>>> FROM (SELECT
>>> (ST_DumpPoints(
>>> ST_Segmentize(
>>> ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
>>> 61.179167,-156.77 71.285833)'),
>>> 600
>>> )::geometry
>>> )).geom
>>> ) s1),
>>> 600
>>> )
>>> ORDER BY elevation DESC limit 1;
>>>
>>> It took some fiddling to find a syntax that Postgresql would accept, but
>>> eventually that's what I came up with. Unfortunately, far from improving
>>> performance, it killed it - in running the query, it went from 22 seconds
>>> to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at
>>> the query execution plan shows, at least partially, why:
>>>
>>> QUERY PLAN
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Limit (cost=17119748.98..17119748.98 rows=1 width=4)
>>> InitPlan 1 (returns $0)
>>> -> Aggregate (cost=17.76..17.77 rows=1 width=32)
>>> -> Result (cost=0.00..5.25 rows=1000 width=32)
>>> -> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
>>> Sort Key: data.elevation DESC
>>> -> Seq Scan on data (cost=0.00..17015226.76 rows=20900890
>>> width=4)
>>> Filter: st_dwithin(location, $0, '600'::double precision)
>>> (8 rows)
>>>
>>> So apparently it is now doing a sequential scan on data rather than
>>> using the index. And, of course, sorting 20 million rows is not trivial
>>> either. Did I do something wrong with forming the query?
>>>
>>> -----------------------------------------------
>>> Israel Brewster
>>> Systems Analyst II
>>> Ravn Alaska
>>> 5245 Airport Industrial Rd
>>> Fairbanks, AK 99709
>>> (907) 450-7293
>>> -----------------------------------------------
>>>
>>>
>>> On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel(at)ravnalaska(dot)net>
>>> wrote:
>>>
>>>> I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
>>>> latitude (numeric), longitude(numeric), elevation(integer) data, along with
>>>> a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
>>>> with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
>>>> elevation along a path, for which purpose I've come up with the following
>>>> query (for one particular path example):
>>>>
>>>> SELECT elevation FROM data
>>>>
>>>>
>>>>
>>>>
>>>> WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
>>>> 61.179167,-156.77 71.285833)'), 600)
>>>>
>>>>
>>>>
>>>> ORDER BY elevation LIMIT 1;
>>>>
>>>> The EXPLAIN ANALYZE output of this particular query (
>>>> https://explain.depesz.com/s/heZ) shows:
>>>>
>>>>
>>>>
>>>> QUERY PLAN
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------
>>>> Limit (cost=4.83..4.83 rows=1 width=4) (actual
>>>> time=22653.840..22653.842 rows=1 loops=1)
>>>> -> Sort (cost=4.83..4.83 rows=1 width=4) (actual
>>>> time=22653.837..22653.837 rows=1 loops=1)
>>>> Sort Key: elevation DESC
>>>> Sort Method: top-N heapsort Memory: 25kB
>>>> -> Index Scan using location_gix on data (cost=0.42..4.82
>>>> rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
>>>> Index Cond: (location && '0102000020E6100000020000002C1
>>>> 1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514
>>>> 0'::geography)
>>>> Filter: (('0102000020E6100000020000002
>>>> C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
>>>> && _st_expand(location, '600'::double precision)) AND
>>>> _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C
>>>> 062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
>>>> '600'::double precision, true))
>>>> Rows Removed by Filter: 4934534
>>>> Planning time: 0.741 ms
>>>> Execution time: 22653.906 ms
>>>> (10 rows)
>>>>
>>>> So it is using the index properly, but still takes a good 22 seconds to
>>>> run, most of which appears to be in the Index Scan.
>>>>
>>>> Is there any way to improve this, or is this going to be about as good
>>>> as it gets with the number of rows being dealt with? I was planning to use
>>>> this for a real-time display - punch in a couple of points, get some
>>>> information about the route between, including maximum elevation - but with
>>>> it taking 22 seconds for the longer routes at least, that doesn't make for
>>>> the best user experience.
>>>>
>>>> It's perhaps worth noting that the example above is most likely a worst
>>>> case scenario. I would expect the vast majority of routes to be
>>>> significantly shorter, and I want to say the shorter routes query much
>>>> faster [testing needed]. That said, the faster the better, even for short
>>>> routes :-)
>>>> -----------------------------------------------
>>>> Israel Brewster
>>>> Systems Analyst II
>>>> Ravn Alaska
>>>> 5245 Airport Industrial Rd
>>>> Fairbanks, AK 99709
>>>> (907) 450-7293
>>>> -----------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Israel Brewster 2017-01-05 22:43:17 Re: Improve PostGIS performance with 62 million rows?
Previous Message Paul Ramsey 2017-01-05 22:09:03 Re: Improve PostGIS performance with 62 million rows?