Re: BUG #13846: INSERT ONCONFLICTconsumessequencersonconflicts

From: Paul <paul(at)salesintel(dot)com>
To: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)heroku(dot)com>, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13846: INSERT ONCONFLICTconsumessequencersonconflicts
Date: 2016-05-06 21:39:00
Message-ID: 572d0f00.e33a320a.57866.ffffdf07@mx.google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Just for clarity on the real financial problem of usability,

In our situation, we have ‘event’ tables, and related ‘category’ tables; I’m simplifying and generalizing a bit. The event tables capture streams of activity. Each record has several columns that categorize each event. Sometimes new categories come along from the raw event data source, but several categories would never have more than a couple hundred discreet values, if that. We also need to filter, group, and aggregate the events along some of those categories.

So structure is something like this (although with more category columns)

CREATE TABLE event (
id BIGSERAL NOT NULL,
cat_a_id SMALLINT NOT NULL, /* would have to be BIGINT if using IOC */
cat_b_id SMALLINT NOT NULL, /* would have to be BIGINT if using IOC */
measure FLOAT,
CONSTRAINT pk_event PRIMARY KEY (id)
);
CREATE INDEX ix_event1 ON event USING BTREE(cat_a_id);
CREATE INDEX ix_event2 ON event USING BTREE(cat_b_id, cat_a_id);

CREATE TABLE cat_a (
id SMALLSERIAL NOT NULL, /* would have to be BIGSERIAL if using IOC */
name TEXT NOT NULL,
CONSTRAINT pk_cat_a PRIMARY KEY (id)
);
CREATE UNIQUE INDEX ux_cat_a ON EVENT USING BTREE(name);

CREATE TABLE cat_b (
id SMALLSERIAL NOT NULL, /* would have to be BIGSERIAL if using IOC */
name TEXT NOT NULL,
CONSTRAINT pk_cat_b PRIMARY KEY (id)
);
CREATE UNIQUE INDEX ux_cat_b ON EVENT USING BTREE(name);

Some ‘raw’ input source records might look like this
“so-reg”, “open”, 10
“so-reg”, “stalled”, 1
“no-reg”, “stalled”, 1
“unknown’, “ongoing”, 1

As part of ingesting the raw event source into the event table, we do an upsert on the category tables, get the category’s id, then store the id in the respective category column in the event table.

Just in a few months, one of our event tables, having 9 category type columns, is already up over 200M records, and may hit a 1B in a few more months, and maybe in a year or so over 4B.

If we had continued to use INSERT ON CONFLICT [IOC], the cat_a_id and cat_b_id columns in the event table would have to be changed from SMALLINT to BIGINT, just because the related id sequences from the cat_a and cat_b tables are getting consumed on every insert into the event table, even though 99.9% of the time nothing is actually being inserted into the cat_a and cat_b tables.

We’re running our stuff in a cloud host, so we’re paying for memory, cpu, SSD storage, HDD storage, and network bandwidth (backups) on a monthly basis.

So if using IOC, our one event table with 9 category columns at 200M records goes from 9 * 2 = 18 bytes-per-record to 9 * 8 = 72 bytes-per-record. Or, from 3.6GB to 14.4GB! At a 1B events 18GB to 72GB!! And that doesn’t include the indexes on the event table for certain category columns and combinations, and that’s just one table in one replica.

56GB extra means we can fit way less of the event rows into memory, it takes up way more expensive SSD space (and HDD for backup), uses way more bandwidth and just takes longer for backups and replication, and the CPU when doing sorts and grouping etc has to compare lots more bits. This can add up to thousands of dollars or more over time.

Also, if we had kept using IOC and made the category columns just INT, would we have had to convert 4B records from INT to BIGINT sometime next year. Ouch!!

So, are problem with using IOC isn’t at all technical, it’s financial! Very expensive just to save a few lines of SQL…

I’m kinda curious what the initial use cases for IOC were? If it actually was for UPSERTs, under what conditions? A huge amount of data is just events and categories; i.e. tweets, emails, sensor readings, purchase orders, etc., and region, person, device, product-brand, etc. Ingesting this kind of data is a prime use case for UPSERTs.. Just curious what problem IOC is solving?

Anyways, none of this is meant to be at all a knock on anyone contributing to PG.. It’s an absolutely amazing amount of high quality value for free.. Thanks to all of you.

From: Paul
Sent: Friday, May 6, 2016 2:06 PM
To: David G. Johnston
Cc: Andres Freund; Peter Geoghegan; pgsql-bugs
Subject: RE: [BUGS] BUG #13846: INSERT ONCONFLICTconsumessequencersonconflicts

Hi Dave,
Thanks for the informative reply. We absolutely agree it’s not a bug in implementation, but maybe a bug in conceptualization with regard to usability. Not knowing the initial motive for the feature, we assumed it was meant for an intuitively simpler form of the rather common upsert pattern, but it seems INSERT ON CONFLICT was meant for some other purpose that we don’t quite understand. We stopped using and went back to a manual approach, which also works great because PostgreSQL is just kinda great 😊.

If any related enhancements where to be done, our ‘holy grail’ would be an implementation of the MERGE statement, like this, or this, or this… We can at least ask for it right?? 😉

Thanks for all your guys’ hard work

-p

From: David G. Johnston
Sent: Friday, May 6, 2016 1:02 PM
To: Paul
Cc: Andres Freund; Peter Geoghegan; pgsql-bugs
Subject: Re: [BUGS] BUG #13846: INSERT ON CONFLICTconsumessequencersonconflicts

On Wed, Jan 6, 2016 at 5:14 PM, Paul <paul(at)salesintel(dot)com> wrote:
If I had a BEFORE INSERT trigger, it would only execute for records that were actually going to be inserted. However, you’re now stating that the INSERT ON CONFLICT will execute a BEFORE INSERT trigger even for records that are not inserted?? That doesn’t seem quite logical, and contrary to ‘old fashioned’ upsert logic. Does this also mean that the BEFORE UPDATE trigger is always called as well, or is it never called?

​This seems to boil down to the two possible ways of manually implementing UPSERT:

UPDATE, if not present, INSERT
INSERT, if failing, UPDATE

In the later the before insert trigger fires and influences whether the insert​
 
​fails.  In the former you are already pretty certain the insert will ​succeed because the UPDATE found no records.

We've implemented INSERT, if failing UPDATE.  The insert has to be attempted and right now there is no concept of targeted partial deferrability when constructing the record to be inserted.

To solve this situation it is likely that some form of "UPDATE ON MISSING INSERT" would need to be designed.  The insert portion would specify "DEFAULT" for sequence columns and would execute nextval() only if the ON MISSING portion is executed.

Fundamentally, the difference is that ON MISSING is considerably less complicated than ON CONFLICT.  What is wanted here is an ON MISSING interpretation but what we've implemented is ON CONFLICT.  It seems that the hackers are in agreement that our implementation of ON CONFLICT is consistent with its definition.  That it doesn't efficiently solve problems better handled by ON MISSING - while unfortunate - doesn't constitute a bug: only an opportunity for future enhancement.

David J.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Casey Wireman 2016-05-07 01:15:57 Re: BUG #14126: INSERT ON CONFLICT DO NOTHING auto increments serial primary key when no insert happens.
Previous Message Paul 2016-05-06 20:06:18 Re: BUG #13846: INSERT ON CONFLICTconsumessequencersonconflicts