From: | Manfred Koizar <mkoi-pg(at)aon(dot)at> |
---|---|
To: | Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de> |
Cc: | Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us |
Subject: | Re: cost and actual time |
Date: | 2003-02-20 10:00:19 |
Message-ID: | 94095vcb2p7db03jmbhnu1agh3k2iqrodf@4ax.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Wed, 19 Feb 2003 10:38:54 +0100, Chantal Ackermann
<chantal(dot)ackermann(at)biomax(dot)de> wrote:
>Nested Loop: 53508.86 msec
>Merge Join: 113066.81 msec
>Hash Join: 439344.44 msec
Chantal,
you might have reached the limit of what Postgres (or any other
database?) can do for you with these data structures. Time for
something completely different: Try calculating the counts in
advance.
CREATE TABLE occ_stat (
did INT NOT NULL,
gid INT NOT NULL,
cnt INT NOT NULL
) WITHOUT OIDS;
CREATE INDEX occ_stat_dg ON occ_stat(did, gid);
CREATE INDEX occ_stat_gd ON occ_stat(gid, did);
There is *no* UNIQUE constraint on (did, gid). You get the numbers
you're after by
SELECT did, sum(cnt) AS cnt
FROM occ_stat
WHERE gid = 'whatever'
GROUP BY did
ORDER BY cnt DESC;
occ_stat is initially loaded by
INSERT INTO occ_stat
SELECT did, gid, count(*)
FROM g_o INNER JOIN d_o ON (g_o.sid = d_o.sid)
GROUP BY did, gid;
Doing it in chunks
WHERE sid BETWEEN a::bigint AND b::bigint
might be faster.
You have to block any INSERT/UPDATE/DELETE activity on d_o and g_o
while you do the initial load. If it takes too long, see below for
how to do it in the background; hopefully the load task will catch up
some day :-)
Keeping occ_stat current:
CREATE RULE d_o_i AS ON INSERT
TO d_o DO (
INSERT INTO occ_stat
SELECT NEW.did, g_o.gid, 1
FROM g_o
WHERE g_o.sid = NEW.sid);
CREATE RULE d_o_d AS ON DELETE
TO d_o DO (
INSERT INTO occ_stat
SELECT OLD.did, g_o.gid, -1
FROM g_o
WHERE g_o.sid = OLD.sid);
On UPDATE do both. Create a set of similar rules for g_o.
These rules will create a lot of duplicates on (did, gid) in occ_stat.
Updating existing rows and inserting only new combinations might seem
obvious, but this method has concurrency problems (cf. the thread
"Hard problem with concurrency" on -hackers). So occ_stat calls for
reorganisation from time to time:
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
CREATE TEMP TABLE t (did INT, gid INT, cnt INT) WITHOUT OIDS;
INSERT INTO t
SELECT did, gid, sum(cnt)
FROM occ_stat
GROUP BY did, gid
HAVING count(*) > 1;
DELETE FROM occ_stat
WHERE t.did = occ_stat.did
AND t.gid = occ_stat.gid;
INSERT INTO occ_stat SELECT * FROM t;
DROP TABLE t;
COMMIT;
VACUUM ANALYZE occ_stat; -- very important!!
Now this should work, but the rules could kill INSERT/UPDATE/DELETE
performance. Depending on your rate of modifications you might be
forced to push the statistics calculation to the background.
CREATE TABLE d_o_change (
sid BIGINT NOT NULL,
did INT NOT NULL,
cnt INT NOT NULL
) WITHOUT OIDS;
... ON INSERT TO d_o DO (
INSERT INTO d_o_change VALUES (NEW.sid, NEW.did, 1));
... ON DELETE TO d_o DO (
INSERT INTO d_o_change VALUES (OLD.sid, OLD.did, -1));
... ON UPDATE TO d_o
WHERE OLD.sid != NEW.sid OR OLD.did != NEW.did
DO both
And the same for g_o.
You need a task that periodically scans [dg]_o_change and does ...
BEGIN;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT <any row (or some rows) from x_o_change>;
INSERT INTO occ_stat <see above>;
DELETE <the selected row(s) from x_o_change>;
COMMIT;
Don't forget to VACUUM!
If you invest a little more work, I guess you can combine the
reorganisation into the loader task ...
I have no idea whether this approach is better than what you have now.
With a high INSERT/UPDATE/DELETE rate it may lead to a complete
performance disaster. You have to try ...
Servus
Manfred
From | Date | Subject | |
---|---|---|---|
Next Message | Mike Nielsen | 2003-02-20 10:05:32 | Re: Peluang Usaha yang Luar Biasa |
Previous Message | Mario Weilguni | 2003-02-20 09:27:46 | Re: Write ahead logging |