Re: Performance problems testing with Spamassassin 3.1.0

From: John A Meinel <john(at)arbash-meinel(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance problems testing with Spamassassin 3.1.0
Date: 2005-08-04 15:08:09
Message-ID: 42F22F59.7090901@arbash-meinel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Tom Lane wrote:
> Matthew Schumacher <matt(dot)s(at)aptalaska(dot)net> writes:
>
>> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>> LOOP
>> _token := intokenary[i];
>> INSERT INTO bayes_token_tmp VALUES (_token);
>> END LOOP;
>
>
>> UPDATE
>> bayes_token
>> SET
>> spam_count = greatest_int(spam_count + inspam_count, 0),
>> ham_count = greatest_int(ham_count + inham_count , 0),
>> atime = greatest_int(atime, 1000)
>> WHERE
>> id = inuserid
>> AND
>> (token) IN (SELECT intoken FROM bayes_token_tmp);
>
>
> I don't really see why you think that this path is going to lead to
> better performance than where you were before. Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> awful. Throwing a pile of simple queries at the problem is not
> necessarily the wrong way ... especially when you are doing it in
> plpgsql, because you've already eliminated the overhead of network
> round trips and repeated planning of the queries.

So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?

Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?

What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).

I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 100000-100200, 200000-200200, ... with 658 rows
missing).

If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).

What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.

Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.

Is this generally true, or did I just get lucky on my data?

John
=:->

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

Attachment Content-Type Size
random_exists.txt text/plain 2.6 KB
sequential_exists_test.txt text/plain 2.6 KB

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Matthew Schumacher 2005-08-04 16:13:35 Re: Performance problems testing with Spamassassin 3.1.0
Previous Message Tom Lane 2005-08-04 14:37:02 Re: Performance problems testing with Spamassassin 3.1.0