Quick Links

Re: Duplicate deletion optimizations

From:	Marc Eberhard <eberhardma(at)googlemail(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Duplicate deletion optimizations
Date:	2012-01-06 22:20:35
Message-ID:	CAPaGL54Ard6mjBawyR7ExO0LxGUWpYQK-7Rh-TaPq5vTdoESJQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On 6 January 2012 20:38, Samuel Gendler <sgendler(at)ideasculptor(dot)com> wrote:
> On Fri, Jan 6, 2012 at 12:22 PM, Marc Eberhard <eberhardma(at)googlemail(dot)com>
> wrote:
>> On 6 January 2012 20:02, Samuel Gendler <sgendler(at)ideasculptor(dot)com> wrote:
>> > Have you considered doing the insert by doing a bulk insert into a temp
>> > table and then pulling rows that don't exist across to the final table
>> > in
>> > one query and updating rows that do exist in another query? I did a
>> > very
>> > brief scan of the SO thread and didn't see it suggested. Something like
>> > this:
>> >
>> > update stats_5mn set count = count + t.count
>> > from temp_table t
>> > where stats_5mn.t_value = t.t_value and stats_5mn.t_record and
>> > stats_5mn.output_id = t.output_id;
>> >
>> > insert into stats_5mn
>> > select * from temp_table t
>> > where not exists (
>> > select 1 from stats_5mn s
>> > where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id
>> > =
>> > t.output_id
>> > );
>> >
>> > drop table temp_table;
>>
>> Am I right to assume that the update/insert needs to be placed into a
>> begin / end transaction block if such batch uploads might happen
>> concurrently? Doesn't seem to be the case for this question here, but
>> I like the solution and wonder if it works under more general
>> circumstances.
>
>
> yes, assuming you are concerned about making the insertion atomic.
> Obviously, a failure in the second query after success in the 1st query
> would be problematic outside of a transaction, since any attempt to repeat
> the entire operation would result in repeated updates.

True, but I was more concerned about concurrency, where a second
upsert inserts an element between update/insert from the first. That
would then skip the element in the first upsert as it is neither
updated (doesn't exist at that point in time) nor inserted (does
exists at that later point). Or would that be impossible anyway?

>> What's the overhead of creating and dropping a temporary table? Is it
>> only worth doing this for a large number of inserted/updated elements?
>> What if the number of inserts/updates is only a dozen at a time for a
>> large table (>10M entries)?
>
> pretty minimal, but enough that doing a handful of rows at a time probably
> wouldn't be worth it. You'd surely get index usage on a plain insert in
> such a case, so I'd probably just use an upsert stored proc for doing small
> numbers of rows - unless you are doing large numbers of inserts, just a few
> at a time. In that case, I'd try to accumulate them and then do them in
> bulk. Those are tough questions to answer without a specific context. My
> real answer is 'try it and see.' You'll always get an answer that is
> specific to your exact circumstance that way.

It's a fairly tricky problem. I have a number of sensors producing
energy data about every 5 minutes, but at random times between 1 and
15 minutes. I can't change that as that's the way the hardware of the
sensors works. These feed into another unit, which accumulates them
and forwards them in batches over the Internet to my PostgreSQL
database server every few minutes (again at random times outside my
control and with random batch sizes). To make things worse, if the
Internet connection between the unit and the database server fails, it
will send the latest data first to provide a quick update to the
current values and then send the backlog of stored values. Thus, data
do not always arrive in correct time order.

At the moment I only look at the latest data for each sensor and these
should be as close to real time as possible. Thus, collecting data for
some time to get a larger size for a batch update isn't preferable.
What I want to do, and this is where the upsert problem starts, is to
build a table with energy values at fixed times. These should be
calculated as a linear interpolation between the nearest reported
values from the sensors. Please note each sensor is reporting a
measured energy value (not instant power), which always increases
monotonically with time. To compare the performance of the different
devices that are measured, I need to have the energy values at the
same time and not at the random times when the sensors report. This
also allows the calculation of average power for the devices by taking
the difference of the energy values over longer periods, like 30
minutes.

What I simply haven't got my head around is how to do this in an
efficient way. When new values arrive, the table of interpolated
values needs to be updated. For some times, there will already be
values in the table, but for other times there won't. Thus, the
upsert. If there was a communication failure, the reported sensor
times will go backwards as the most recent is transmitted first until
the backlog is cleared. In that case the interpolation table will be
populated with intermediate values from the first insert with the
latest timestamp and then these values will be refined by the backlog
data as they trickle in. Under normal circumstances, reported
timestamps will be monotonically increasing and the interpolation
table will simply extend to later times. There are more reads from the
interpolation table than updates as there are many clients watching
the data live via a COMET web frontend (or better will be once I get
this working).

I could try to code all of this in the application code (Tomcat
servlets in my case), but I'd much rather like to find an elegant way
to let the database server populated the interpolation table from the
inserted sensor values. I can find the nearest relevant entries in the
interpolation table to be upserted by using date_trunc() on the
timestamp from the sensor value. But I then also need to find out the
closest sensor value in the database with an earlier and possibly
later timestamp around the fixed times in the interpolation table.
Sometimes a new value will result in an update and sometimes not.
Sometimes a new value needs to be added to the interpolation table and
sometimes not.

I know I'm pushing SQL a bit hard with this type of problem, but doing
it in the application logic would result in quite a few round trips
between the database server and the application code. It's sort of an
intellectual challenge for me to see how much I can offload onto the
database server. Thus, my interest in batch upserts.

Another reason is that I don't want to hold any state or intermediate
data in the application code. I want this in the database as it is
much better in storing things persistently than my own code could ever
be. It was designed to do that properly after all!

> By the way, there is definitely a difference between creating a temp table
> and creating a table temporarily. See the postgres docs about temp tables

Yes, I'm aware of that and meant a temporary/temp table, but being old
fashioned I prefer the long form, which is also valid syntax.

From the docs (v9.1.2): CREATE ... { TEMPORARY | TEMP } ... TABLE

Thanks,
Marc

In response to

Re: Duplicate deletion optimizations at 2012-01-06 20:38:30 from Samuel Gendler

Responses

Re: Duplicate deletion optimizations at 2012-01-07 12:20:03 from Pierre C
Re: Duplicate deletion optimizations at 2012-01-07 15:31:23 from Pierre C

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Strange, John W	2012-01-07 00:02:01	Re: Duplicate deletion optimizations
Previous Message	Samuel Gendler	2012-01-06 20:38:30	Re: Duplicate deletion optimizations