From: | John A Meinel <john(at)arbash-meinel(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Performance problems testing with Spamassassin 3.1.0 |
Date: | 2005-08-04 15:08:09 |
Message-ID: | 42F22F59.7090901@arbash-meinel.com |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Tom Lane wrote:
> Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net> writes:
>
>> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>> LOOP
>> _token := intokenary[i];
>> INSERT INTO bayes_token_tmp VALUES (_token);
>> END LOOP;
>
>
>> UPDATE
>> bayes_token
>> SET
>> spam_count = greatest_int(spam_count + inspam_count, 0),
>> ham_count = greatest_int(ham_count + inham_count , 0),
>> atime = greatest_int(atime, 1000)
>> WHERE
>> id = inuserid
>> AND
>> (token) IN (SELECT intoken FROM bayes_token_tmp);
>
>
> I don't really see why you think that this path is going to lead to
> better performance than where you were before. Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> awful. Throwing a pile of simple queries at the problem is not
> necessarily the wrong way ... especially when you are doing it in
> plpgsql, because you've already eliminated the overhead of network
> round trips and repeated planning of the queries.
So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?
Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?
What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)
I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).
I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 100000-100200, 200000-200200, ... with 658 rows
missing).
If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).
What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.
Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.
Is this generally true, or did I just get lucky on my data?
John
=:->
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>
Attachment | Content-Type | Size |
---|---|---|
random_exists.txt | text/plain | 2.6 KB |
sequential_exists_test.txt | text/plain | 2.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Matthew Schumacher | 2005-08-04 16:13:35 | Re: Performance problems testing with Spamassassin 3.1.0 |
Previous Message | Tom Lane | 2005-08-04 14:37:02 | Re: Performance problems testing with Spamassassin 3.1.0 |