From: | Erik Darling <edarling80(at)gmail(dot)com> |
---|---|
To: | James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> |
Cc: | PGSQL-Novice <pgsql-novice(at)postgresql(dot)org> |
Subject: | Re: Update with last known location? |
Date: | 2014-01-29 16:02:01 |
Message-ID: | CAO+EYwKJBtNKEiO_MpSWjAkrJg2i40xNn_oPSUFKTC8XXjQYzw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
I would re-suggest using a CTE to contain each dataset to ensure your
selects are distilling them correctly, and then using a final query to join
them. You can then either update your data directly through the CTE(s), or
insert the results to another table to do some further testing. I think
you'll find this method presents the data a bit more ergonomically for
analysis.
http://www.postgresql.org/docs/9.3/static/queries-with.html
On Wed, Jan 29, 2014 at 10:45 AM, James David Smith <
james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> Hi Erik/all,
>
> I just tried that, but it's tricky. The 'extra' data is indeed coming
> from the right side of the join, but it's hard to select only the max
> from it. Maybe it's possible but I've not managed to do it. Here is
> where I am, which is so very close.
>
> SELECT
> DISTINCT(a.ppid, a.point_time, a.the_geom) as row_that_needs_geom_updating,
> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
> last_known_position_time
> FROM
> test a
> INNER JOIN
> (SELECT ppid,
> point_time,
> the_geom
> FROM test
> WHERE the_geom IS NOT NULL) b
> ON b.point_time < a.point_time
> AND a.ppid = b.ppid
> WHERE a.the_geom IS NULL;
>
> If you see attached screen-print, the output is the rows that I want.
> However I've had to use DISTINCT to stop the duplication. Also I've
> not managed to pull through 'the_geom' from the JOIN. I'm not sure
> how. Anyone?
>
> But it's kind of working. :-)
>
> Worst case if I can't figure out how to solve this in one query I'll
> have to store the result of the above, and then use it as a basis for
> another query I think.
>
> Thanks
>
> James
>
>
>
> On 29 January 2014 12:56, Erik Darling <edarling80(at)gmail(dot)com> wrote:
> > I would try partitioning the second time you call row_number, perhaps by
> ID,
> > and then selecting the MAX() from that, since I think the too much data
> > you're referring to is coming from the right side of your join.
> >
> > On Jan 29, 2014 7:23 AM, "James David Smith" <
> james(dot)david(dot)smith(at)gmail(dot)com>
> > wrote:
> >>
> >> On 28 January 2014 23:15, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
> >> wrote:
> >> > On 29/01/14 11:00, Kevin Grittner wrote:
> >> >>
> >> >> James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >>
> >> >>> Given the data is so large I don't want to be taking the data out
> >> >>> to a CSV or whatever and then loading it back in. I'd like to do
> >> >>> this within the database using SQL. I thought I would be able to
> >> >>> do this using a LOOP to be honest.
> >> >>
> >> >> I would be amazed if you couldn't do this with a single UPDATE
> >> >> statement. I've generally found declarative forms of such work to
> >> >> be at least one order of magnitude faster than going to either a PL
> >> >> or a script approach. I would start by putting together a SELECT
> >> >> query using window functions and maybe a CTE or two to list all the
> >> >> primary keys which need updating and the new values they should
> >> >> have. Once that SELECT was looking good, I would put it in the
> >> >> FROM clause of an UPDATE statement.
> >> >>
> >> >> That should work, but if you are updating a large percentage of the
> >> >> table, I would go one step further before running this against the
> >> >> production tables. I would put a LIMIT on the above-mentioned
> >> >> SELECT of something like 10000 rows, and script a loop that
> >> >> alternates between the UPDATE and a VACUUM ANALYZE on the table.
> >> >>
> >> >> --
> >> >> Kevin Grittner
> >> >> EDB: http://www.enterprisedb.com
> >> >> The Enterprise PostgreSQL Company
> >> >>
> >> >>
> >> > James, you might consider dropping as many indexes on the table as you
> >> > safely can, and rebuilding them after the mass update. If you have
> lots
> >> > of
> >> > such indexes, you will find this apprtoach to be a lot faster.
> >> >
> >> >
> >> > Cheers,
> >> > Gavin
> >>
> >> Hi all,
> >>
> >> Thanks for your help and assistance. I think that window functions,
> >> and inparticular the PARTITION function, is 100% the way to go. I've
> >> been concentrating on a SELECT statement for now and am close but not
> >> quite close enough. The below query gets all the data I want, but
> >> *too* much. What I've essentially done is:
> >>
> >> - Select all the rows that don't have any geom information
> >> - Join them with all rows before this point that *do* have geom
> >> information.
> >> - Before doing this join, use partition to generate row numbers.
> >>
> >> The attached screen grab shows the result of my query below.
> >> Unfortunately this is generating alot of joins that I don't want. This
> >> won't be practical when doing it with 75,000 people.
> >>
> >> Thoughts and code suggestions very much appreciated... if needed I
> >> could put together some SQL to create an example table?
> >>
> >> Thanks
> >>
> >> SELECT row_number() OVER (PARTITION BY test.point_time ORDER BY
> >> test.point_time) as test_row,
> >> test.ppid as test_ppid,
> >> test.point_time as test_point_time,
> >> test.the_geom as test_the_geom,
> >> a.ppid as a_ppid,
> >> a.point_time as a_point_time,
> >> a.the_geom as a_the_geom,
> >> a.a_row
> >> FROM test
> >> LEFT JOIN (
> >> SELECT the_geom,
> >> ppid,
> >> point_time,
> >> row_number() OVER (ORDER BY ppid, point_time) as a_row
> >> FROM test
> >> WHERE the_geom IS NOT NULL) a
> >> ON a.point_time < test.point_time
> >> AND a.ppid = test.ppid
> >> WHERE test.the_geom IS NULL
> >> ORDER BY test.point_time)
> >>
> >>
> >> --
> >> Sent via pgsql-novice mailing list (pgsql-novice(at)postgresql(dot)org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-novice
> >>
> >
>
From | Date | Subject | |
---|---|---|---|
Next Message | Gurpreet Sachdeva -X (gusachde - ARICENT TECHNOLOGIES MAURIITIUS LIMITED at Cisco) | 2014-01-29 16:22:37 | Stored Procedure in Java |
Previous Message | James David Smith | 2014-01-29 15:45:34 | Re: Update with last known location? |