From: | Mario Splivalo <mario(dot)splivalo(at)megafon(dot)hr> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | SELECT INTO large FKyed table is slow |
Date: | 2010-11-28 11:46:11 |
Message-ID: | 4CF24103.8030209@megafon.hr |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
The database for monitoring certain drone statuses is quite simple:
CREATE TABLE samples (
sample_id integer not null primary key,
sample_timestamp timestamp not null default now()
);
CREATE TABLE drones (
drone_id integer not null primary key,
drone_log_notice character varying,
crone_coordinates point not null,
drone_temperature float,
drone_pressure float
);
CREATE TABLE drones_history (
drone_id integer not null,
sample_id integer not null,
drone_log_notice character varying,
drone_temperature float,
drone_pressure float,
constraint drones_history_pk primary key (drone_id, sample_id),
constraint drones_history_fk__samples foreign key (sample_id)
references samples(sample_id),
constraint drones_history_fk__drones foreign key (drone_id) references
drones(drone_id)
);
Every ten to twenty minutes I receive CSV file with most of the drones
statuses. CSV file includes data for new drones, if they're put into
use. When I receive new data I load whole CSV file to a database, then
call stored procedure that 'deals' with that data.
So far I have around 6000 samples, around 160k drones and drones_history
is around 25M rows.
The CSV file contains around 15k-20k of 'rows', mostly data about old
drones. Every now and then (on every 5th - 10th CSV-insert) there is
data with around 1000-5000 new drones.
Here is what I do in stored procedure, after i COPYed the data from the
CSV to temporary.drones table:
First, I create temporary table, inside the procedure, that holds rows
for the new drones:
CREATE TEMPORARY TABLE tmpNew ON COMMIT DROP AS
SELECT drone_id, log_notice, coord_x, coord_y, temp, press
FROM temp.drones WHERE NOT EXISTS (SELECT 1 FROM public.drones WHERE
public.drones.drone_id = temporary.drone.drone_id);
This is done in miliseconds, even if the count for the new drones is
large (i've tested it with 10k new drones although I real-life action
I'd never get more thatn 5k new drones per CSV).
INSERT INTO public.drones (drone_id, drone_log_notice, coordinates,
drone_temperature, drone_temperature)
SELECT drone_id, log_notice, point(coord_x, coord_y) as coordinates,
temp, press FROM tmpNew;
INSERT INTO public.drones_history (sample_id, drone_id,
drone_log_notice, drone_temperature, drone_pressure)
SELECT a_sample_id, drone_id, log_notice, temp, pressue FROM tmpNew;
This is also done in miliseconds.
Now, I 'update' data for the existing drones, and fill in the history
table on those drones. First I create temporary table with just the
changed rows:
CREATE TEMPORARY TABLE tmpUpdate ON COMMIT DROP AS
SELECT a_batch_id, t.drone_id, t.log_notice, t.temp, t.press
FROM temporary.drones t
JOIN public.drones p
ON t.drone_id = p.drone_id
WHERE p.drone_log_notice != t.log_notice OR p.temp != t.temp OR p.press
!= t.press;
Now, that part is also fast. I usualy have around 100-1000 drones that
changed 'state', but sometimes I get even half of the drones change
states (around 50k) and creation of the tmpUpdate takes no more than ten
to twenty milliseconds.
This is the slow part:
INSERT INTO drones_history (sample_id, drone_id, drone_log_notice,
drone_temperature, drone_pressure)
SELECT * FROM tmpUpdate;
For 100 rows this takes around 2 seconds. For 1000 rows this takes
around 40 seconds. For 5000 rows this takes around 5 minutes.
For 50k rows this takes around 30 minutes! Now this is where I start lag
because I get new CSV every 10 minutes or so.
And the last part is to upadte the actual drones table:
UPDATE public.drones p
SET drone_log_notice = t.log_notice, drone_temperature = t.temp,
drone_pressure = t.press
FROM temporary.drones t
WHERE t.drone_id = p.drone_id
AND (t.log_notice != p.drone_log_notice OR t.temp != p.drone_temperature
OR p.press != t.drone_pressure);
This is also very fast, even when almost half the table is updated the
UPDATE takes around 10 seconds. Usualy it's around 30-50 ms.
The machine I'm doing this has 4 GB of RAM, dual-Xeon something (3GHz).
Two SAS drives in mirror, capable of around 100 MB/s in sequential r/w
(i know it means nothing, but just to get an idea).
Database is around 2 GB is size (pg_database_size). When I dump/recreate
the database I can speedup things a bit, but after half day of
operations the INSERTs are slow again.
When I do dump/restore of the database I get around 40/50 MB/sec
reding/writing from the disk (COPYing data, PK/FK constraints creation),
but when that INSERT gets stuck io-wait goes to skies - iostat shows
that Postgres is mainly reading from the disks, around 800k/sec - 1024k/sec.
I've set shared_buffers to 256M, work_mem to 96M, wal_buffers to 16M and
checkpoint_segments to 16. I've turned off autovaccum, I do
analyze/vacuum after each insert-job is done, after TRUNCATEing
temporary.drones table.
Out of despair I tried to set fsync=off, but that gave me just a small
performance improvement.
When I remove foreign constraints (drones_history_fk__samples and
drones_history_fk__drones) (I leave the primary key on drones_history)
than that INSERT, even for 50k rows, takes no more than a second.
So, my question is - is there anything I can do to make INSERTS with PK
faster? Or, since all the reference checking is done inside the
procedure for loading data, shall I abandon those constraints entirely?
Mario
From | Date | Subject | |
---|---|---|---|
Next Message | Mario Splivalo | 2010-11-28 12:02:34 | Simple database, multiple instances? |
Previous Message | Greg Smith | 2010-11-26 22:30:16 | Re: CPUs for new databases |