| From: | Israel Brewster <israel(at)ravnalaska(dot)net> | 
|---|---|
| To: | Rémi Cura <remi(dot)cura(at)gmail(dot)com> | 
| Cc: | Paul Ramsey <pramsey(at)cleverelephant(dot)ca>, pgsql-general <pgsql-general(at)postgresql(dot)org> | 
| Subject: | Re: Improve PostGIS performance with 62 million rows? | 
| Date: | 2017-01-05 22:55:06 | 
| Message-ID: | 29E6361D-19DA-433E-BB82-10395F4FDF30@ravnalaska.net | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
On Jan 5, 2017, at 1:38 PM, Rémi Cura <remi(dot)cura(at)gmail(dot)com> wrote:
> 
> Hey,
> 1 sec seems really good in this case,
> and I'm assuming you tuned postgres so the main index fits into ram (work_mem and all other stuff).
> 
> You could avoid a CTE by mixing both cte.
> 
> WITH pts AS (
>     SELECT (pt).geom, (pt).path[1] as vert
>     FROM 
>     ST_DumpPoints(
>         ST_Segmentize(
>             ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
>             600
>         )::geometry
>     ) as pt
> ) 
> SELECT elevation 
> FROM data 
> INNER JOIN (SELECT 
>     ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
>     FROM pts a 
>     INNER JOIN pts b 
>     ON a.vert=b.vert-1 AND b.vert>1) segments
> ON  ST_DWithin(location, segments.short_line, 600)
> ORDER BY elevation DESC limit 1;
> 
> 
> Then you could remove the useless and (potentially explosive if you have large number of dump points) inner join on points : 
> "FROM pts a 
>     INNER JOIN pts b " 
> 
> You could simply use a window function to generate the segments, like in here <https://github.com/Remi-C/PPPP_utilities/blob/master/postgis/rc_DumpSegments.sql#L51>.
> The idea is to dump points, order them by path, and then link each point with the previous one (function lag).
> Assuming you don't want to use the available function,
> this would be something like : 
> 
>  
> 
> WITH segments AS (
>     SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
>                           ,(pt).geom) AS short_line
>     FROM ST_DumpPoints(
>           ST_Segmentize(
>             ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
>             600
>         )::geometry
>     ) as pt
> ) 
> SELECT elevation 
> FROM data ,segments
> WHERE segments.short_line IS NOT NULL --the first segment is null by design (lag function)
>   AND  ST_DWithin(location, segments.short_line, 600) = TRUE
> ORDER BY elevation DESC 
> limit 1;
> 
> 
> I don't know if you can further improve this query after that,
> but I'll guess it would reduce your time and be more secure regarding scaling.
> 
> 
> if you want to further improve your result, 
> you'll have to reduce the number of row in your index, 
> that is partition your table into several tables !
> 
> This is not easy to do with current postgres partitionning methods as far as I know
> (partitionning is easy, automatic efficient query is hard).
> 
> Another way would be to reduce you requirement, and consider that in some case you may want less details in the altimetry, which would allow you to use a Level Of Detail approach.
> 
> Congrats for the well explained query/problem anyway !
> Cheers,
> Rémi-C
Ooooh, nice use of a window function - that change right there cut the execution time in half! I was able to shave off a few hundreds of a second more but tweaking the ST_Segmentize length parameter up to 5,000 (still have to play with that number some), so execution time is now down to the sub-300ms range. If I reduce the radius I am looking around the line, I can additionally improve the time to around 200 ms, but I'm not sure that will be an option. Regardless, 300ms is rather impressive, I think. Thanks!
> 
> 2017-01-05 23:09 GMT+01:00 Paul Ramsey <pramsey(at)cleverelephant(dot)ca <mailto:pramsey(at)cleverelephant(dot)ca>>:
> Varying the segment length upwards might have a salutary effect for a while, as the efficiency improvement of fewer inner loops battles with the inefficiency of having more points selected by the index filter. Worth an experiment.
> 
> P
> 
> On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel(at)ravnalaska(dot)net <mailto:israel(at)ravnalaska(dot)net>> wrote:
> 
>> On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey(at)cleverelephant(dot)ca <mailto:pramsey(at)cleverelephant(dot)ca>> wrote:
>> 
>> Yes, you did. You want a query that spits out a tupleset of goemetries (one each for each wee segment), and then you can join that set to your main table using st_dwithin() as the join clause.
>> So start by ditching the main table and just work on a query that generates a pile of wee segments.
> 
> Ahhh, I see you've done this sort of thing before (http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html <http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html>) :-)
> 
> So following that advice I came up with the following query:
> 
> WITH dump AS (SELECT
>     ST_DumpPoints(
>         ST_Segmentize(
>             ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
>             600
>         )::geometry
>     ) as pt
> ),
> pts AS (
>     SELECT (pt).geom, (pt).path[1] as vert FROM dump
> )
> SELECT elevation 
> FROM data 
> INNER JOIN (SELECT 
>     ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
>     FROM pts a 
>     INNER JOIN pts b 
>     ON a.vert=b.vert-1 AND b.vert>1) segments
> ON  ST_DWithin(location, segments.short_line, 600)
> ORDER BY elevation DESC limit 1;
> 
> Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/RsTD <https://explain.depesz.com/s/ukwc>):
> 
>                                                                                                                  QUERY PLAN                                                                                                                 
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=11611706.90..11611706.91 rows=1 width=4) (actual time=1171.814..1171.814 rows=1 loops=1)
>    CTE dump
>      ->  Result  (cost=0.00..5.25 rows=1000 width=32) (actual time=0.024..1.989 rows=1939 loops=1)
>    CTE pts
>      ->  CTE Scan on dump  (cost=0.00..20.00 rows=1000 width=36) (actual time=0.032..4.071 rows=1939 loops=1)
>    ->  Sort  (cost=11611681.65..11611768.65 rows=34800 width=4) (actual time=1171.813..1171.813 rows=1 loops=1)
>          Sort Key: data.elevation DESC
>          Sort Method: top-N heapsort  Memory: 25kB
>          ->  Nested Loop  (cost=0.55..11611507.65 rows=34800 width=4) (actual time=0.590..1167.615 rows=28408 loops=1)
>                ->  Nested Loop  (cost=0.00..8357.50 rows=1665 width=64) (actual time=0.046..663.475 rows=1938 loops=1)
>                      Join Filter: (a.vert = (b.vert - 1))
>                      Rows Removed by Join Filter: 3755844
>                      ->  CTE Scan on pts b  (cost=0.00..22.50 rows=333 width=36) (actual time=0.042..0.433 rows=1938 loops=1)
>                            Filter: (vert > 1)
>                            Rows Removed by Filter: 1
>                      ->  CTE Scan on pts a  (cost=0.00..20.00 rows=1000 width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
>                ->  Index Scan using location_gix on data  (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256 rows=15 loops=1938)
>                      Index Cond: (location && _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision))
>                      Filter: (((st_makeline(ARRAY[a.geom, b.geom]))::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, (st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision, true))
>                      Rows Removed by Filter: 7
>  Planning time: 4.318 ms
>  Execution time: 1171.994 ms
> (22 rows)
> 
> So not bad. Went from 20+ seconds to a little over 1 second. Still noticeable for a end user, but defiantly usable - and like mentioned, that's a worst-case scenario query. Thanks!
> 
> Of course, if you have any suggestions for further improvement, I'm all ears :-)
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293 <tel:(907)%20450-7293>
> -----------------------------------------------
> 
>> 
>> On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel(at)ravnalaska(dot)net <mailto:israel(at)ravnalaska(dot)net>> wrote:
>> On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey(at)cleverelephant(dot)ca <mailto:pramsey(at)cleverelephant(dot)ca>> wrote:
>>> 
>>> The index filters using bounding boxes.  A long, diagonal route will have a large bounding box, relative to the area you actually care about (within a narrow strip of the route). Use ST_Segmentize() to add points to your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new lines from those points, each line very short. The maximum index effectiveness will come when your line length is close to your buffer width.
>>> 
>>> P
>> 
>> Ok, I think I understand the concept. So attempting to follow your advice, I modified the query to be:
>> 
>> SELECT elevation
>> FROM data
>> WHERE
>>     ST_DWithin(
>>         location,
>>         (SELECT ST_MakeLine(geom)::geography as split_line
>>          FROM (SELECT
>>         (ST_DumpPoints(
>>             ST_Segmentize(
>>                 ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
>>                 600
>>             )::geometry
>>         )).geom
>>     ) s1),
>>         600
>>     )
>> ORDER BY elevation DESC limit 1;
>> 
>> It took some fiddling to find a syntax that Postgresql would accept, but eventually that's what I came up with. Unfortunately, far from improving performance, it killed it - in running the query, it went from 22 seconds to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the query execution plan shows, at least partially, why:
>> 
>>                                   QUERY PLAN                                  
>> ------------------------------------------------------------------------------
>>  Limit  (cost=17119748.98..17119748.98 rows=1 width=4)
>>    InitPlan 1 (returns $0)
>>      ->  Aggregate  (cost=17.76..17.77 rows=1 width=32)
>>            ->  Result  (cost=0.00..5.25 rows=1000 width=32)
>>    ->  Sort  (cost=17119731.21..17171983.43 rows=20900890 width=4)
>>          Sort Key: data.elevation DESC
>>          ->  Seq Scan on data  (cost=0.00..17015226.76 rows=20900890 width=4)
>>                Filter: st_dwithin(location, $0, '600'::double precision)
>> (8 rows)
>> 
>> So apparently it is now doing a sequential scan on data rather than using the index. And, of course, sorting 20 million rows is not trivial either. Did I do something wrong with forming the query?
>> 
>> -----------------------------------------------
>> Israel Brewster
>> Systems Analyst II
>> Ravn Alaska
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7293 <tel:(907)%20450-7293>
>> -----------------------------------------------
>> 
>>> 
>>> On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel(at)ravnalaska(dot)net <mailto:israel(at)ravnalaska(dot)net>> wrote:
>>> I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):
>>> 
>>> SELECT elevation FROM data                                                                                                                                                                                                                                                                                                                                                                                WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600)                                                                                                                                                                                                                                                                              ORDER BY elevation LIMIT 1;
>>> 
>>> The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ>) shows:
>>> 
>>>                                                                                                                                                                       QUERY PLAN                                                                                                                                                                      
>>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Limit  (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
>>>    ->  Sort  (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
>>>          Sort Key: elevation DESC
>>>          Sort Method: top-N heapsort  Memory: 25kB
>>>          ->  Index Scan using location_gix on data  (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
>>>                Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
>>>                Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
>>>                Rows Removed by Filter: 4934534
>>>  Planning time: 0.741 ms
>>>  Execution time: 22653.906 ms
>>> (10 rows)
>>> 
>>> So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.
>>> 
>>> Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.
>>> 
>>> It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
>>> -----------------------------------------------
>>> Israel Brewster
>>> Systems Analyst II
>>> Ravn Alaska
>>> 5245 Airport Industrial Rd
>>> Fairbanks, AK 99709
>>> (907) 450-7293 <tel:(907)%20450-7293>
>>> -----------------------------------------------
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Pierre Ducroquet | 2017-01-06 10:09:44 | PostgreSQL not reusing free space in table ? | 
| Previous Message | Israel Brewster | 2017-01-05 22:43:17 | Re: Improve PostGIS performance with 62 million rows? |