From: | Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr> |
---|---|
To: | "S Grannis" <sjg(at)email(dot)com>, pgsql-general(at)postgresql(dot)org |
Cc: | <pgsql-admin(at)postgresql(dot)org> |
Subject: | Performance HOWTO - pseudo counter example |
Date: | 2002-04-29 15:45:08 |
Message-ID: | 200204291745.08315.jm.poure@freesurf.fr |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin pgsql-general |
Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit :
> I think our work-around for now will be to SELECT the column we wish to
> analyze into a flat file and then run a Perl script to do the actual
> counting.
Dear all,
I wrote a small howto to solve S Grannis performance questions on Count()
function. The idea is to create and maintain a pseudo-count table using
triggers and PLpgSQL.
Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for
lack of space reason (only on 1 Million records). Code is included to
generate fake test data. Could someone help me test the howto on 16 million
records?
Thank you for your feedback,
Cheers,
Jean-Michel
*************************************************************************
Performance HOWTO - pseudo counter example
This document is released under PostgreSQL license
*************************************************************************
This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL
using PLpgSQL and triggers.
1) Performance background
This small howto is insprired in reply to an email on
pgsql-general(at)postgresql(dot)org complaining about PostgreSQL speed.
The user needs to run COUNT statements on a large database of 65.000.000
records. The table structure is basically as follows:
CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4
);
In our example, data_yd is a year value between 1950 and 2050.
The user needs to run the following query:
SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year.
where foo_year is a date between 1950 and 2050.
The query takes more than two hours to execute on a double-processor computer
running PostgreSQL and GNU/Linux. The proposed solution creates a
pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return
a result in 0.005 second. Initilisation itself of the pseudo-counter table
should take less than 30 minutes.
1) INSTALLATION
a) Database creation
Open a terminal windows, connect as 'postgres' user:
root(at)localhost>su postgres
Create an empty database:
postgresql(at)localhost>psql template1;
template1=\CREATE DATABASE pseudo_counter;
template1=\q
b) PLpgSQL declaration
PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on
the database itself:
postgresql(at)localhost>CREATELANG plpgsql pseudo_counter
c) Data table
We first need to create the table stucture:
CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4,
"data_counterenabled" bool DEFAULT 'f'
) WITH OIDS;
CREATE INDEX data_yd_idx ON data USING btree (data_yd);
CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled);
And create a PLpgSQL function to add fake records:
CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
tStart timestamp ;
BEGIN
tStart = now ();
IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN
FOR iLoop in 1 .. $1 LOOP
INSERT INTO data (data_yd)
VALUES (
int8 (random () * ($3-$2) +$2)
);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';
To insert 16 million records with a year range between 1950 and 2050, enter:
SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and
fill mor than 3.2 Gb on disc.
If you cannot wait that long :
For testing, insert 126.953 records :
SELECT init_fakedata(126953, 1950, 2050);
This takes 40s on my server.
Then, repeat 8 times:
INSERT INTO data (data_yd)
SELECT data_yd FROM data;
This should produce 64999936 fake records more quickly as no random function
is
used.
Enter:
CHECKPOINT;
VACUUM ANALYSE;
to clear data cache and update statistics.
d) Pseudo-count table
Now, let's create a pseudo-count table.
CREATE TABLE "pseudo_count" (
"count_oid" serial,
"count_year" int4,
"count_value" int4 DEFAULT 0
) WITH OIDS;
CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value);
CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year);
... and initialize it with the required data (values in the 1950 - 2050 range)
:
CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
BEGIN
IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN
FOR iLoop in $1 .. $2 LOOP
INSERT INTO pseudo_count (count_year)
VALUES (iLoop);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';
Example :
SELECT init_pseudocount(1950, 2050) ;
will create the required records for years 1900 to 2100.
e) PLpgSQL function and trigger
Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()'
function. TG_OP is used to catch the trigger context ('insert', 'update' or
'delete').
CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE
rec record;
BEGIN
IF (TG_OP=''UPDATE'') THEN
IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'')
THEN
UPDATE pseudo_count
SET count_value = count_value +1
WHERE count_year = new.data_yd
AND count_value >= 0;
END IF;
IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'')
THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = new.data_yd
AND count_value > 0;
END IF;
IF (old.data_yd <> new.data_yd) THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = old.data_yd
AND count_value > 0;
UPDATE pseudo_count
SET count_value = count_value + 1
WHERE count_year = new.data_yd
AND count_value >= 0 ;
END IF;
END IF;
IF (TG_OP=''DELETE'') THEN
UPDATE pseudo_count
SET count_value = count_value - 1
WHERE count_year = old.data_yd
AND count_value >= 0 ;
END IF;
IF (TG_OP=''UPDATE'') THEN
RETURN new;
ELSE
RETURN old;
END IF;
END;
' LANGUAGE 'plpgsql';
CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH
ROW EXECUTE PROCEDURE tg_data();
2) USAGE
a) Initialisation
Set "pseudo_countenabled" flag on:
UPDATE data
SET data_counterenabled = 't'
WHERE data_counterenabled = 'f'
b) Run pseudo-count queries
Instead of :
SELECT COUNT (data_yd)
FROM data
WHERE data_yd = foo_year.
you now can run:
SELECT count_value
FROM pseudo_count
WHERE pseudo_date = foo_year
The anwer comes in 0.005 second.
c) Limits
Before loading large amount of data, triggers on table 'data' should be
dropped and recreated afterwards.
From | Date | Subject | |
---|---|---|---|
Next Message | Mike Baker | 2002-04-29 20:42:07 | ALTER TABLE hangs |
Previous Message | berger, ralf | 2002-04-29 14:15:55 | Re: How do I get encrypted password access from a cgi app |
From | Date | Subject | |
---|---|---|---|
Next Message | Uros Gruber | 2002-04-29 16:34:04 | Strange indexing |
Previous Message | Steve Lane | 2002-04-29 15:37:54 | Re: Postgres utils chewing RAM |