From: | Rémi Cura <remi(dot)cura(at)gmail(dot)com> |
---|---|
To: | Nathan Clayton <nathanclayton(at)gmail(dot)com> |
Cc: | PostgreSQL General <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Indexing large table of coordinates with GiST |
Date: | 2015-01-16 08:40:24 |
Message-ID: | CAJvUf_sQ1ZENaGtQbY-fx_mS-yJKcNKwCFUW2MnU7=YHgWFywg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Please let me one more guess ^^
Third guess : you are using topology (nodes are indexed by node_id).
- If this is the case, you could use postgis topology.
- The gain is that with this topology model, you store *shared linestring*,
and not shared points.
More seriously from what you say it seems possible to use pg_pointcloud
with your data,
if the following assumption is correct :
*When querying by other attributes, the points you get are roughly in the
same area (at least the area is a significant subset of the total area).*
So to be perfectly clear : if for a node with node_id *N*, you can expect
that the node with node_id *N+1* is spatially close to the node N, you can
use pg_pointcloud and it will be effective.
Then the solution could be : partition your points spatially (aka, from
your billions points, you create few millions of groups of points, with a
grid, clustering, whatever).
Then create an index on each group of points bounding box.
Then create an index (gist) on range(node_id) for each group of point.
.. create indexes for other attributes : on range(attribute)
The you can query your data effectively, and the index size will fit into
RAM (about 1Go for 8 Million patch for me).
The query would be :
- first get group of points of potential interest
(WHERE st_intersects(group_of_points.bbox, your_polygon) AND
group_of_points.range(node_id)&& numrange(123,678) AND other attribute
filtering )
- second, from the group of points selected, extract the actual points,
and do the fine filtering you need
(WHERE ST_Intersects(ST_MakePoint(point.X,point.Y,point.Z),your_polygon
AND node_id BETWEEN 123 AND 678 ...))
If the assumption is correct, it works well (for instance, all the billions
points I use also have a time stamp (equivalent to your node_id I would
say), I frequently query on time range and it is as fast as spatial query
(that is milliseconds order of magnitude) ).
To give you an order of magnitude of work involved it would take me a
couple of hours to put your data into pg_pointcloud (computing time would
be about 12 hours multi-processed , absolutely all inclusive).
Cheers,
Rémi-C
2015-01-16 1:18 GMT+01:00 Nathan Clayton <nathanclayton(at)gmail(dot)com>:
>
> On 1/15/2015 12:36 PM, Daniel Begin wrote:
>
>>
>> Thank, there is a lot of potential ways to resolve this problem!
>>
>> For Rob, here is a bit of context concerning my IT environment…
>>
>> Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The
>> PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 connection
>> with write cache enabled and backup battery) and a temp_tablespaces is
>> pointing to a 1TB internal drive.
>>
>> Now, let me answered/questioned given proposals in the order I received
>> them…
>>
>> 1-Andy, I will set maintenance_work_mem as large as I can unless someone
>> points to an important caveat.
>>
>> 2-Vick, partitioning the table could have been very interesting. However,
>> I will have to query the table using both the node ID (which could have
>> provided a nice partition criterion) and/or the node location (find nodes
>> within a polygon). I am not familiar with table partition but I suspect I
>> can’t create a spatial index on a table that have been partitioned (split
>> into multiple tables that inherit from the “master" table). Am I right?
>>
>> 3-Rémi, so many rows does not necessarily mean either raster or points
>> cloud (but it’s worth asking!-). As I mentioned previously, I must be able
>> to query the table not only using nodes location (coordinates) but also
>> using the few other fields the table contains (but mainly node IDs). So, I
>> don’t think it could work, unless you tell me otherwise?
>>
>> 4-Paul, the nodes distribution is all over the world but mainly over
>> inhabited areas. However, if I had to define a limit of some sort, I would
>> use the dateline. Concerning spatial queries, I will want to find nodes
>> that are within the boundary of irregular polygons (stored in another
>> table). Is querying on irregular polygons is compatible with geohashing?
>>
>> Regards,
>>
>> Daniel
>>
>>
>> Provided you have an integer primary key on both your node tables and
> polygon tables, would it make sense to preprocess the overlaps and have a
> many-to-many table with the node-id and polygon-id? Depending on the speed
> in which data is ingested, you could easily build triggers to run after
> inserts/updates to keep the table updated, or you could create a globally
> unique autoincrement field that tracks revisions and update everything
> after a given high-water mark.
>
> Lookups and joins would be using integers and should give you much better
> performance than searching through the polygons.
>
> For the many-to-many table, something like (you can obviously parse it out
> into smaller batches on the insert if you need to so you don't blow up your
> memory usage. If needed you can have two tables partitioned on either the
> node-id or the polygon-id to speed up lookups, as this table has the
> potential to carry many times the records in either table - worst case
> would be a cartesian join if all nodes fall within all polygons):
>
> create table node_polygon (
> node_id bigint not null,
> polygon_id bigint not null,
> primary key (node_id, polygon_id)
> );
>
> insert into node_polygon (node_id, polygon_id)
> select
> node_id,
> polygon_id
> from
> node
> inner join polygon
> on node.point <@ polygon.shape;
>
> create index ix_node_polygon_polygon on node_polygon (polygon_id);
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
From | Date | Subject | |
---|---|---|---|
Next Message | Pawel Veselov | 2015-01-16 08:43:58 | Re: Casting hstore to json |
Previous Message | Jim Nasby | 2015-01-16 07:41:00 | Re: Simple Atomic Relationship Insert |