Re: Planner: rows=1 after "similar to" where condition.

From: "Joris Dobbelsteen" <Joris(at)familiedobbelsteen(dot)nl>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Planner: rows=1 after "similar to" where condition.
Date: 2008-03-04 22:32:46
Message-ID: E4953B65D9E5054AA6C227B410C56AA9C3D4@exchange1.joris2k.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>-----Original Message-----
>From: pgsql-general-owner(at)postgresql(dot)org
>[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Joris
>Dobbelsteen
>Sent: Monday, 25 February 2008 17:08
>To: Tom Lane
>Cc: Gregory Stark; Scott Marlowe; pgsql-general(at)postgresql(dot)org
>Subject: Re: [GENERAL] Planner: rows=1 after "similar to"
>where condition.
>
>>-----Original Message-----
>>From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
>>Sent: Monday, 25 February 2008 16:34
>>To: Joris Dobbelsteen
>>Cc: Gregory Stark; Scott Marlowe; pgsql-general(at)postgresql(dot)org
>>Subject: Re: [GENERAL] Planner: rows=1 after "similar to"
>>where condition.
>>
>>"Joris Dobbelsteen" <Joris(at)familiedobbelsteen(dot)nl> writes:
>>> "Bitmap Heap Scan on log_syslog syslog (cost=11168.32..16988.84
>>> rows=1
>>> width=221) (actual time=11145.729..30067.606 rows=212 loops=1)"
>>> " Recheck Cond: (((program)::text = 'amavis'::text) AND
>>> ((facility)::text = 'mail'::text))"
>>> " Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~
>>> '***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed
>>[A-Za-z0-9]+,
>>> [][0-9\\.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>,
>>> (Resent-Message-ID: <[^<>]+>, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+,
>>> queued.as: [^ ,]+, [0-9]+ ms)$'::text))"
>>
>>It's not too surprising that you'd get a small selectivity
>estimate for
>>such a long regexp; the default estimate is just based on the
>amount of
>>fixed text in the pattern, and you've got a lot.
>>
>>If you increase the stats target for the column to 100 or
>more then it
>>will try actually applying the regexp to all the histogram entries.
>>That might or might not give you a better estimate.
>
>I will try that, expect result back within a few days (have it
>collect some better sample set). Unfortunally the regex is not
>so much for narrowing down the selection, but rather
>guarenteeing the format of the messages.
>You seem to consider the common case differently, and I can
>agree for most part. Unfortunally my use-case is different
>from the expected. That said, might a less aggressive
>selectivity estimation for long strings work better in the common case?

A new test case (I did a fresh VACUUM ANALYZE with your statistics for
text set to 100):
Arround 5288 rows out of 4.3 Million match.

"Bitmap Heap Scan on log_syslog syslog (cost=17777.94..53522.27 rows=1
width=226) (actual time=41661.354..92719.083 rows=5288 loops=1)"
" Recheck Cond: (((program)::text = 'amavis'::text) AND
((facility)::text = 'mail'::text))"
" Filter: (((priority)::text = 'notice'::text) AND ((text)::text ~
'***:^(?:amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\) Passed [A-Za-z0-9]+,
[][0-9\\.]* <[^<>]+> -> <[^<>]+>, Message-ID: <[^<>]+>,
(Resent-Message-ID: <[^<>]+>, |)mail.id: [^ ,]+, Hits: [-+0-9\\.,]+,
queued.as: [^ ,]+, [0-9]+ ms)$'::text))"
" -> BitmapAnd (cost=17777.94..17777.94 rows=15279 width=0) (actual
time=4641.009..4641.009 rows=0 loops=1)"
" -> Bitmap Index Scan on "IX_log_syslog_program"
(cost=0.00..2908.86 rows=113370 width=0) (actual time=2913.718..2913.718
rows=113897 loops=1)"
" Index Cond: ((program)::text = 'amavis'::text)"
" -> Bitmap Index Scan on "IX_log_syslog_facility"
(cost=0.00..14868.57 rows=591426 width=0) (actual
time=1715.591..1715.591 rows=586509 loops=1)"
" Index Cond: ((facility)::text = 'mail'::text)"
"Total runtime: 92738.389 ms"

Unfortunally, Tom, it seems the data varies to much and is not included
in the histogram. Probably the data varies too much. In this case, a
regex NOT for selection but rather for forcing the input format should
be done differently.
My construction with the regex as "substring()" construction and a
"WHERE substring() IS NOT NULL" seems to give a better estimate in these
cases. The result seems equivalent.

"Bitmap Heap Scan on log_syslog syslog (cost=17783.78..53966.33
rows=5844 width=226) (actual time=59095.076..110913.152 rows=5295
loops=1)"
" Recheck Cond: (((program)::text = 'amavis'::text) AND
((facility)::text = 'mail'::text))"
" Filter: (((priority)::text = 'notice'::text) AND
("substring"((text)::text, 'amavis\\[[0-9]+\\]: \\([0-9]+-[-0-9]+\\)
Passed \\"[A-Za-z0-9]+\\", [][0-9.]* <[^<>]+> -> <[^<>]+>, Message-ID:
<[^<>]+>, (Resent-Message-ID: <[^<>]+>, |)mail_id: [^ ,]+, Hits:
[-+0-9.,]+, queued_as: [^ ,]+, [0-9]+ ms'::text, '\\'::text) IS NOT
NULL))"
" -> BitmapAnd (cost=17783.78..17783.78 rows=15279 width=0) (actual
time=4003.657..4003.657 rows=0 loops=1)"
" -> Bitmap Index Scan on "IX_log_syslog_program"
(cost=0.00..2908.86 rows=113370 width=0) (actual time=1652.278..1652.278
rows=113939 loops=1)"
" Index Cond: ((program)::text = 'amavis'::text)"
" -> Bitmap Index Scan on "IX_log_syslog_facility"
(cost=0.00..14868.57 rows=591426 width=0) (actual
time=2339.943..2339.943 rows=586653 loops=1)"
" Index Cond: ((facility)::text = 'mail'::text)"
"Total runtime: 110921.978 ms"

Note: few added rows in second run is due to the fact that this is a
live table that receives input continuesly.

Concluding:
Your estimator is really great and seems to give pretty good estimates!
Except for regular expressions, which seem more tricky in this regard.

A good note might be to NOT use regex for forcing a format, but rather
quite strict selection only. I believe your estimator might be a bit too
rough for long regexes.

- Joris

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2008-03-04 23:14:50 Re: GRANT ALL ON recursive for all tables in my scheme?
Previous Message Mary Anderson 2008-03-04 22:13:02 Re: Documenting a DB schema