From: | Erik Darling <edarling80(at)gmail(dot)com> |
---|---|
To: | James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> |
Cc: | PGSQL-Novice <pgsql-novice(at)postgresql(dot)org> |
Subject: | Re: Update with last known location? |
Date: | 2014-01-30 15:45:15 |
Message-ID: | CAO+EYw+XX=Lye3Vwv2uGBLQsrSE4Le8OdfJh-XnMjtg9oTO3oQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
That solution looks like it totally works. Cool.
I was running it against a Redshift instance (only thing available to me at
work), which doesn't allow me to write subqueries like that. That's why I
was playing with LAG() instead.
An error occurred when executing the SQL command:
SELECT
data.ppid,
data.point_time,
CASE
WHEN data.the_geom IS NULL
THEN (
--Get all locations ...
ERROR: This type of correlated subquery pattern is not supported yet [SQL
State=0A000]
Execution time: 0.09s
1 statement(s) failed.
On Thu, Jan 30, 2014 at 10:28 AM, James David Smith <
james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> You're a star Erik. Thanks again. I'll see how this goes.
>
> I posted the question on stack-exchange earlier too by the way and
> someone replied with the below code. So I'm going to try both. Cheers
> again.
>
> SELECT
> data.ppid,
> data.point_time,
> CASE
> WHEN data.the_geom IS NULL
> THEN (
> --Get all locations with an earlier time stamp for that ppid
> SELECT geom.the_geom
> FROM test_data geom
> WHERE data.ppid = geom.ppid
> AND geom.point_time < data.point_time
> AND geom.the_geom IS NOT NULL
> AND NOT EXISTS (
> -- Cull all but the most recent one
> SELECT *
> FROM test_data cull
> WHERE cull.ppid = geom.ppid
> AND geom.the_geom IS NOT NULL
> AND cull.point_time < data.point_time
> AND cull.point_time > geom.point_time
> AND cull.the_geom IS NOT NULL
> )
> )
> ELSE data.the_geom
> end
> FROM test_data data
>
> On 30 January 2014 15:24, Erik Darling <edarling80(at)gmail(dot)com> wrote:
> > Hi James,
> >
> > This is pretty close, but I have to get back to work for the time being.
> > Feel free to mess with it. If you don't come up with a fully working
> > solution, I can look at it later tonight (US EST). From your test data,
> only
> > one row is missing the correct geom.
> >
> > WITH C (id, pt, gm, dr) as (
> > SELECT ppid, point_time, the_geom,
> > dense_rank() over (partition by ppid, the_geom order by ppid,
> > point_time) as dr
> > FROM test_data
> > )
> > SELECT id, pt, gm,
> > CASE WHEN gm IS NULL THEN
> > LAG(gm, cast(c.DR as int) ) OVER (PARTITION BY id ORDER BY id, pt)
> > ELSE gm END as gm2
> > FROM C
> > ORDER BY id, pt, gm
> >
> >
> >
> > On Thu, Jan 30, 2014 at 8:58 AM, James David Smith
> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >>
> >> All,
> >>
> >> Here's a SQL fiddle of my problem:
> >>
> >> http://sqlfiddle.com/#!15/77157
> >>
> >> Thanks
> >>
> >> James
> >>
> >> On 30 January 2014 11:19, James David Smith <
> james(dot)david(dot)smith(at)gmail(dot)com>
> >> wrote:
> >> > Hi Erik / all,
> >> >
> >> > I don't think that will work, as what happens if one of the people has
> >> > two missing periods of time within their day?
> >> >
> >> > I've made a self-contained example of my problem below. Would you mind
> >> > trying to produce in the code below what you think I should do? Or if
> >> > anyone else fancies having a go then please do.
> >> >
> >> > I very much appreciate your help by the way. Thank you. I'm really at
> >> > a loss with this. :-(
> >> >
> >> > James
> >> > --------------------------------------
> >> >
> >> > DROP TABLE test_data;
> >> >
> >> > CREATE TABLE test_data(
> >> > ppid integer,
> >> > point_time timestamp without time zone,
> >> > the_geom integer);
> >> >
> >> > INSERT INTO test_data VALUES
> >> > ('1', '2012-01-01 07:00', '1'),
> >> > ('1', '2012-01-01 07:01', '1'),
> >> > ('1', '2012-01-01 07:02', '1'),
> >> > ('1', '2012-01-01 07:03', NULL), -- null should be replaced with 1
> >> > ('1', '2012-01-01 07:04', NULL), -- null should be replaced with 1
> >> > ('1', '2012-01-01 07:05', '5'),
> >> > ('1', '2012-01-01 07:06', '5'),
> >> > ('1', '2012-01-01 07:07', '5'),
> >> > ('1', '2012-01-01 07:08', NULL), -- null should be replaced with 5
> >> > ('1', '2012-01-01 07:09', NULL), -- null should be replaced with 5
> >> > ('1', '2012-01-01 07:10', NULL), -- null should be replaced with 5
> >> > ('1', '2012-01-01 07:11', NULL), -- null should be replaced with 5
> >> > ('2', '2013-05-02 07:12', '24'),
> >> > ('2', '2013-05-02 07:13', '24'),
> >> > ('2', '2013-05-02 07:14', '24'),
> >> > ('2', '2013-05-02 07:15', NULL), -- null should be replaced with 24
> >> > ('2', '2013-05-02 07:16', NULL), -- null should be replaced with 24
> >> > ('2', '2013-05-02 07:17', '44'),
> >> > ('2', '2013-05-02 07:18', '44'),
> >> > ('2', '2013-05-02 07:19', NULL), -- null should be replaced with 44
> >> > ('2', '2013-05-02 07:20', '4'),
> >> > ('2', '2013-05-02 07:21', '4'),
> >> > ('2', '2013-05-02 07:22', '4');
> >> >
> >> > WITH missing_geoms AS (
> >> > SELECT ppid,
> >> > point_time,
> >> > the_geom
> >> > FROM test_data
> >> > WHERE the_geom IS NULL)
> >> > ---
> >> > ,filled_geoms AS (
> >> > SELECT ppid,
> >> > point_time,
> >> > the_geom
> >> > FROM test_data
> >> > WHERE the_geom IS NOT NULL)
> >> > ---
> >> > ,partitioned AS (
> >> > SELECT missing_geoms.ppid as missing_geoms_ppid,
> >> > missing_geoms.point_time as missing_geoms_point_time,
> >> > missing_geoms.the_geom as missing_geoms_the_geom,
> >> > filled_geoms.ppid as filled_geoms_ppid,
> >> > filled_geoms.point_time as filled_geoms_point_time,
> >> > filled_geoms.the_geom as filled_geoms_the_geom
> >> > FROM missing_geoms
> >> > LEFT JOIN filled_geoms
> >> > ON filled_geoms.point_time < missing_geoms.point_time
> >> > AND filled_geoms.ppid = missing_geoms.ppid
> >> > ORDER BY missing_geoms_ppid,
> >> > missing_geoms_point_time)
> >> > ---
> >> > SELECT *
> >> > FROM partitioned;
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On 29 January 2014 22:10, Erik Darling <edarling80(at)gmail(dot)com> wrote:
> >> >> Hi James,
> >> >>
> >> >> Yeah, that looks like the right place to me.
> >> >>
> >> >>
> >> >> On Wed, Jan 29, 2014 at 1:14 PM, James David Smith
> >> >> <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >>>
> >> >>> Hi Erik,
> >> >>>
> >> >>> Do you mean in this section of the SQL?
> >> >>> .....
> >> >>> filled_geoms AS (
> >> >>> SELECT
> >> >>> ppid,
> >> >>> point_time,
> >> >>> the_geom
> >> >>> FROM
> >> >>> hybrid_location
> >> >>> WHERE
> >> >>> the_geom IS NOT NULL)
> >> >>> ...
> >> >>>
> >> >>> Thanks
> >> >>>
> >> >>> James
> >> >>>
> >> >>> On 29 January 2014 17:57, Erik Darling <edarling80(at)gmail(dot)com>
> wrote:
> >> >>> > Hi James,
> >> >>> >
> >> >>> > I think you're still stuck with sort of unnecessary ('too much' )
> >> >>> > data
> >> >>> > coming from the right side of your left join. If so, one option I
> >> >>> > would
> >> >>> > consider is using DENSE_RANK() the way you use ROW_NUMBER(), in
> the
> >> >>> > filled_geoms table. If you partition by id and order by date
> >> >>> > descending,
> >> >>> > you
> >> >>> > can do an additional d_rank = 1 filter to only get the most recent
> >> >>> > activity.
> >> >>> > I believe this is what you want to set your NULL values to, no?
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Wed, Jan 29, 2014 at 12:41 PM, James David Smith
> >> >>> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >>> >>
> >> >>> >> On 29 January 2014 16:02, Erik Darling <edarling80(at)gmail(dot)com>
> >> >>> >> wrote:
> >> >>> >> > I would re-suggest using a CTE to contain each dataset to
> ensure
> >> >>> >> > your
> >> >>> >> > selects are distilling them correctly, and then using a final
> >> >>> >> > query
> >> >>> >> > to
> >> >>> >> > join
> >> >>> >> > them. You can then either update your data directly through the
> >> >>> >> > CTE(s),
> >> >>> >> > or
> >> >>> >> > insert the results to another table to do some further
> testing. I
> >> >>> >> > think
> >> >>> >> > you'll find this method presents the data a bit more
> >> >>> >> > ergonomically
> >> >>> >> > for
> >> >>> >> > analysis.
> >> >>> >> >
> >> >>> >> > http://www.postgresql.org/docs/9.3/static/queries-with.html
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > On Wed, Jan 29, 2014 at 10:45 AM, James David Smith
> >> >>> >> > <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >>> >> >>
> >> >>> >> >> Hi Erik/all,
> >> >>> >> >>
> >> >>> >> >> I just tried that, but it's tricky. The 'extra' data is indeed
> >> >>> >> >> coming
> >> >>> >> >> from the right side of the join, but it's hard to select only
> >> >>> >> >> the
> >> >>> >> >> max
> >> >>> >> >> from it. Maybe it's possible but I've not managed to do it.
> Here
> >> >>> >> >> is
> >> >>> >> >> where I am, which is so very close.
> >> >>> >> >>
> >> >>> >> >> SELECT
> >> >>> >> >> DISTINCT(a.ppid, a.point_time, a.the_geom) as
> >> >>> >> >> row_that_needs_geom_updating,
> >> >>> >> >> max(b.point_time) OVER (PARTITION BY a.ppid, a.point_time) as
> >> >>> >> >> last_known_position_time
> >> >>> >> >> FROM
> >> >>> >> >> test a
> >> >>> >> >> INNER JOIN
> >> >>> >> >> (SELECT ppid,
> >> >>> >> >> point_time,
> >> >>> >> >> the_geom
> >> >>> >> >> FROM test
> >> >>> >> >> WHERE the_geom IS NOT NULL) b
> >> >>> >> >> ON b.point_time < a.point_time
> >> >>> >> >> AND a.ppid = b.ppid
> >> >>> >> >> WHERE a.the_geom IS NULL;
> >> >>> >> >>
> >> >>> >> >> If you see attached screen-print, the output is the rows that
> I
> >> >>> >> >> want.
> >> >>> >> >> However I've had to use DISTINCT to stop the duplication. Also
> >> >>> >> >> I've
> >> >>> >> >> not managed to pull through 'the_geom' from the JOIN. I'm not
> >> >>> >> >> sure
> >> >>> >> >> how. Anyone?
> >> >>> >> >>
> >> >>> >> >> But it's kind of working. :-)
> >> >>> >> >>
> >> >>> >> >> Worst case if I can't figure out how to solve this in one
> query
> >> >>> >> >> I'll
> >> >>> >> >> have to store the result of the above, and then use it as a
> >> >>> >> >> basis
> >> >>> >> >> for
> >> >>> >> >> another query I think.
> >> >>> >> >>
> >> >>> >> >> Thanks
> >> >>> >> >>
> >> >>> >> >> James
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> On 29 January 2014 12:56, Erik Darling <edarling80(at)gmail(dot)com>
> >> >>> >> >> wrote:
> >> >>> >> >> > I would try partitioning the second time you call
> row_number,
> >> >>> >> >> > perhaps
> >> >>> >> >> > by
> >> >>> >> >> > ID,
> >> >>> >> >> > and then selecting the MAX() from that, since I think the
> too
> >> >>> >> >> > much
> >> >>> >> >> > data
> >> >>> >> >> > you're referring to is coming from the right side of your
> >> >>> >> >> > join.
> >> >>> >> >> >
> >> >>> >> >> > On Jan 29, 2014 7:23 AM, "James David Smith"
> >> >>> >> >> > <james(dot)david(dot)smith(at)gmail(dot)com>
> >> >>> >> >> > wrote:
> >> >>> >> >> >>
> >> >>> >> >> >> On 28 January 2014 23:15, Gavin Flower
> >> >>> >> >> >> <GavinFlower(at)archidevsys(dot)co(dot)nz>
> >> >>> >> >> >> wrote:
> >> >>> >> >> >> > On 29/01/14 11:00, Kevin Grittner wrote:
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> James David Smith <james(dot)david(dot)smith(at)gmail(dot)com> wrote:
> >> >>> >> >> >> >>
> >> >>> >> >> >> >>> Given the data is so large I don't want to be taking
> the
> >> >>> >> >> >> >>> data
> >> >>> >> >> >> >>> out
> >> >>> >> >> >> >>> to a CSV or whatever and then loading it back in. I'd
> >> >>> >> >> >> >>> like to
> >> >>> >> >> >> >>> do
> >> >>> >> >> >> >>> this within the database using SQL. I thought I would
> be
> >> >>> >> >> >> >>> able
> >> >>> >> >> >> >>> to
> >> >>> >> >> >> >>> do this using a LOOP to be honest.
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> I would be amazed if you couldn't do this with a single
> >> >>> >> >> >> >> UPDATE
> >> >>> >> >> >> >> statement. I've generally found declarative forms of
> such
> >> >>> >> >> >> >> work
> >> >>> >> >> >> >> to
> >> >>> >> >> >> >> be at least one order of magnitude faster than going to
> >> >>> >> >> >> >> either
> >> >>> >> >> >> >> a
> >> >>> >> >> >> >> PL
> >> >>> >> >> >> >> or a script approach. I would start by putting
> together a
> >> >>> >> >> >> >> SELECT
> >> >>> >> >> >> >> query using window functions and maybe a CTE or two to
> >> >>> >> >> >> >> list
> >> >>> >> >> >> >> all
> >> >>> >> >> >> >> the
> >> >>> >> >> >> >> primary keys which need updating and the new values they
> >> >>> >> >> >> >> should
> >> >>> >> >> >> >> have. Once that SELECT was looking good, I would put it
> >> >>> >> >> >> >> in
> >> >>> >> >> >> >> the
> >> >>> >> >> >> >> FROM clause of an UPDATE statement.
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> That should work, but if you are updating a large
> >> >>> >> >> >> >> percentage
> >> >>> >> >> >> >> of
> >> >>> >> >> >> >> the
> >> >>> >> >> >> >> table, I would go one step further before running this
> >> >>> >> >> >> >> against
> >> >>> >> >> >> >> the
> >> >>> >> >> >> >> production tables. I would put a LIMIT on the
> >> >>> >> >> >> >> above-mentioned
> >> >>> >> >> >> >> SELECT of something like 10000 rows, and script a loop
> >> >>> >> >> >> >> that
> >> >>> >> >> >> >> alternates between the UPDATE and a VACUUM ANALYZE on
> the
> >> >>> >> >> >> >> table.
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> --
> >> >>> >> >> >> >> Kevin Grittner
> >> >>> >> >> >> >> EDB: http://www.enterprisedb.com
> >> >>> >> >> >> >> The Enterprise PostgreSQL Company
> >> >>> >> >> >> >>
> >> >>> >> >> >> >>
> >> >>> >> >> >> > James, you might consider dropping as many indexes on the
> >> >>> >> >> >> > table
> >> >>> >> >> >> > as
> >> >>> >> >> >> > you
> >> >>> >> >> >> > safely can, and rebuilding them after the mass update.
> If
> >> >>> >> >> >> > you
> >> >>> >> >> >> > have
> >> >>> >> >> >> > lots
> >> >>> >> >> >> > of
> >> >>> >> >> >> > such indexes, you will find this apprtoach to be a lot
> >> >>> >> >> >> > faster.
> >> >>> >> >> >> >
> >> >>> >> >> >> >
> >> >>> >> >> >> > Cheers,
> >> >>> >> >> >> > Gavin
> >> >>> >> >> >>
> >> >>> >> >> >> Hi all,
> >> >>> >> >> >>
> >> >>> >> >> >> Thanks for your help and assistance. I think that window
> >> >>> >> >> >> functions,
> >> >>> >> >> >> and inparticular the PARTITION function, is 100% the way to
> >> >>> >> >> >> go.
> >> >>> >> >> >> I've
> >> >>> >> >> >> been concentrating on a SELECT statement for now and am
> close
> >> >>> >> >> >> but
> >> >>> >> >> >> not
> >> >>> >> >> >> quite close enough. The below query gets all the data I
> want,
> >> >>> >> >> >> but
> >> >>> >> >> >> *too* much. What I've essentially done is:
> >> >>> >> >> >>
> >> >>> >> >> >> - Select all the rows that don't have any geom information
> >> >>> >> >> >> - Join them with all rows before this point that *do* have
> >> >>> >> >> >> geom
> >> >>> >> >> >> information.
> >> >>> >> >> >> - Before doing this join, use partition to generate row
> >> >>> >> >> >> numbers.
> >> >>> >> >> >>
> >> >>> >> >> >> The attached screen grab shows the result of my query
> below.
> >> >>> >> >> >> Unfortunately this is generating alot of joins that I don't
> >> >>> >> >> >> want.
> >> >>> >> >> >> This
> >> >>> >> >> >> won't be practical when doing it with 75,000 people.
> >> >>> >> >> >>
> >> >>> >> >> >> Thoughts and code suggestions very much appreciated... if
> >> >>> >> >> >> needed
> >> >>> >> >> >> I
> >> >>> >> >> >> could put together some SQL to create an example table?
> >> >>> >> >> >>
> >> >>> >> >> >> Thanks
> >> >>> >> >> >>
> >> >>> >> >> >> SELECT row_number() OVER (PARTITION BY test.point_time
> >> >>> >> >> >> ORDER
> >> >>> >> >> >> BY
> >> >>> >> >> >> test.point_time) as test_row,
> >> >>> >> >> >> test.ppid as test_ppid,
> >> >>> >> >> >> test.point_time as test_point_time,
> >> >>> >> >> >> test.the_geom as test_the_geom,
> >> >>> >> >> >> a.ppid as a_ppid,
> >> >>> >> >> >> a.point_time as a_point_time,
> >> >>> >> >> >> a.the_geom as a_the_geom,
> >> >>> >> >> >> a.a_row
> >> >>> >> >> >> FROM test
> >> >>> >> >> >> LEFT JOIN (
> >> >>> >> >> >> SELECT the_geom,
> >> >>> >> >> >> ppid,
> >> >>> >> >> >> point_time,
> >> >>> >> >> >> row_number() OVER (ORDER BY ppid, point_time)
> as
> >> >>> >> >> >> a_row
> >> >>> >> >> >> FROM test
> >> >>> >> >> >> WHERE the_geom IS NOT NULL) a
> >> >>> >> >> >> ON a.point_time < test.point_time
> >> >>> >> >> >> AND a.ppid = test.ppid
> >> >>> >> >> >> WHERE test.the_geom IS NULL
> >> >>> >> >> >> ORDER BY test.point_time)
> >> >>> >> >> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> Hi Erik / all,
> >> >>> >>
> >> >>> >> So I think I've managed to re-write my queries using CTEs. The
> >> >>> >> below
> >> >>> >> code now does get me the data that I want from this. But to do so
> >> >>> >> it
> >> >>> >> is going to create a frankly huge table in the bit of the SQL
> where
> >> >>> >> it
> >> >>> >> makes the table called 'partitioned'. My rough guess is that
> it'll
> >> >>> >> have to make a table of about 100 billion rows in order to get
> data
> >> >>> >> I
> >> >>> >> need ( about 108 million rows).
> >> >>> >>
> >> >>> >> Could someone please glance through it for me and suggest how to
> >> >>> >> write
> >> >>> >> it more efficiently?
> >> >>> >>
> >> >>> >> Thanks
> >> >>> >>
> >> >>> >> James
> >> >>> >>
> >> >>> >> WITH missing_geoms AS (
> >> >>> >> SELECT ppid,
> >> >>> >> point_time,
> >> >>> >> the_geom
> >> >>> >> FROM hybrid_location
> >> >>> >> WHERE the_geom IS NULL)
> >> >>> >> -----------------
> >> >>> >> ,filled_geoms AS (
> >> >>> >> SELECT ppid,
> >> >>> >> point_time,
> >> >>> >> the_geom
> >> >>> >> FROM hybrid_location
> >> >>> >> WHERE the_geom IS NOT NULL)
> >> >>> >> ----------------
> >> >>> >> ,partitioned AS (
> >> >>> >> SELECT missing_geoms.ppid,
> >> >>> >> missing_geoms.point_time,
> >> >>> >> missing_geoms.the_geom,
> >> >>> >> filled_geoms.ppid,
> >> >>> >> filled_geoms.point_time,
> >> >>> >> filled_geoms.the_geom,
> >> >>> >> row_number() OVER ( PARTITION BY
> missing_geoms.ppid,
> >> >>> >> missing_geoms.point_time
> >> >>> >> ORDER BY missing_geoms.ppid,
> >> >>> >> missing_geoms.point_time,
> >> >>> >> filled_geoms.ppid,
> >> >>> >> filled_geoms.point_time DESC)
> >> >>> >> FROM missing_geoms
> >> >>> >> LEFT JOIN filled_geoms
> >> >>> >> ON filled_geoms.point_time < missing_geoms.point_time
> >> >>> >> AND filled_geoms.ppid = missing_geoms.ppid)
> >> >>> >> --------------
> >> >>> >> SELECT *
> >> >>> >> FROM partitioned
> >> >>> >> WHERE row_number = 1;
> >> >>> >>
> >> >>> >> James
> >> >>> >
> >> >>> >
> >> >>
> >> >>
> >
> >
>
From | Date | Subject | |
---|---|---|---|
Next Message | David Johnston | 2014-01-30 15:46:41 | Re: Update with last known location? |
Previous Message | James David Smith | 2014-01-30 15:28:44 | Re: Update with last known location? |