From: | Γιωργος Βαλκανας <lebiathan(at)gmail(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | "SELECT .. WHERE NOT IN" query running for hours |
Date: | 2011-01-07 02:36:52 |
Message-ID: | AANLkTi=ir8MdWX3ecbhfx3AFp1Uv7DWkLG68S26VLkRM@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi all,
I'm using postgres 8.4.2 on a Ubuntu Linux machine.
I have several tables, one of which is named Document, which of course
represents information I need about my documents. I also have another
table, similar to the first one, called Doc2. The schema of both tables is
the following:
CREATE TABLE "Document"
(
docid integer NOT NULL DEFAULT nextval('doc_id_seq'::regclass),
hwdocid character varying(511) NOT NULL,
pubdate bigint,
finished boolean DEFAULT false,
"location" character varying(200),
title tsvector,
description tsvector,
"content" text,
CONSTRAINT pk_docid PRIMARY KEY (docid),
CONSTRAINT hwdocid_uniq UNIQUE (hwdocid)
)
WITH (
OIDS=FALSE
);
The hwdocid in this occasion is no longer than 12 characters. The reason for
being 511 max, is because the same schema is used by other applications.
What i wish to do is dump contents from Doc2 to Document, provided that
the hwdocid from Doc2 is not present in Document (as the entries will be
similar). Doc2 contains ~100000 rows while Document contains ~1000000.
Now, I wrote a simple query to do this, which is the following:
INSERT INTO "Document" ( hwdocid, pubdate, finished, "location", title,
description, "content" )
SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
SELECT d.hwdocid
FROM "Document" d
)
After running for about half an hour in pgadmin3, I stopped the execution,
since I saw that
what I was doing was pretty dumb, as with every insert the Document would
increase (and I
know beforehand that data from Doc2 contain unique hwdocid values). At first
I thought that each
INSERT creates a new transaction, which is why it was taking so long. So I
though I should do
something else..
So, I though that I should dump the documents I want to a temp table and
then simply insert them in
the Document table. Before that, I wanted to see however, how many documents
I was trying to
insert (as an indication of why it took so long). So I simply did the select
part for those documents.
SELECT *
FROM "Doc2" d2
WHERE d2.hwdocid NOT IN (
SELECT d.hwdocid
FROM "Document" d
)
I submitted the query again and let it run. After running for 5 hours, I
stopped the query and submitted
the "explain query". After running for ~10 minutes, I also stopped the query
explanation phase. So I
re-wrote the query as:
SELECT hwdocid, pubdate, finished, "location", title, description, "content"
FROM "Doc2" d2
WHERE NOT EXISTS (
SELECT d.hwdocid
FROM "Document" d
WHERE d.hwdocid = d2.hwdocid
)
and asked for the explanation, which was:
Hash Anti Join (cost=72484.24..90988.89 rows=1 width=317) (actual
time=3815.471..9063.184 rows=63836 loops=1)
Hash Cond: ((d2.hwdocid)::text = (d.hwdocid)::text)
-> Seq Scan on "Doc2" d2 (cost=0.00..5142.54 rows=96454 width=317) (actual
time=0.016..186.781 rows=96454 loops=1)
-> Hash (cost=56435.22..56435.22 rows=949922 width=12) (actual
time=3814.968..3814.968 rows=948336 loops=1)
-> Seq Scan on "Document" d (cost=0.00..56435.22 rows=949922 width=12)
(actual time=0.008..1926.191 rows=948336 loops=1)
Total runtime: 9159.050 ms
I then submitted it normally and got a result back in ~5-6 seconds.
So my questions are:
1) Why is it taking *so* long for the first query (with the "NOT IN" ) to do
even the simple select?
2) The result between the two queries should be the same. Since I am not
even returned an explanation, could someone
make a (wild) guess on what is the "NOT IN" statement doing (trying to do)
that is taking so long?
3) My intuition would be that, since there exists a unique constraint on
hwdocid, which implies the existence of an index,
this index would be used. Isn't that so? I mean, since it is a unique field,
shouldn't it just do a sequential scan on Doc2
and then simply query the index if the value exists? What am I getting
wrong?
Thank you very much in advance!
Regards,
George Valkanas
From | Date | Subject | |
---|---|---|---|
Next Message | marc47marc47 | 2011-01-07 03:57:29 | Re: How to turn autovacuum prevent wrap around run faster? |
Previous Message | Jeff Janes | 2011-01-06 23:02:36 | Re: Wrong docs on wal_buffers? |