Quick Links

Re: n_distinct off by a factor of 1000

From:	Ron <ronljohnsonjr(at)gmail(dot)com>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: n_distinct off by a factor of 1000
Date:	2020-06-23 12:51:23
Message-ID:	7879cb87-5a68-edec-9ec6-3f082d78ed86@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"?

On 6/23/20 7:42 AM, Klaudie Willis wrote:
> Friends,
>
> I run Postgresql 12.3, on Windows. I have just discovered a pretty
> significant problem with Postgresql and my data. I have a large table,
> 500M rows, 50 columns. It is split in 3 partitions by Year. In addition
> to the primary key, one of the columns is indexed, and I do lookups on this.
>
> Select * from bigtable b where b.instrument_ref in (x,y,z,...)
> limit 1000
>
> It responded well with sub-second response, and it uses the index of the
> column. However, when I changed it to:
>
> Select * from bigtable b where b.instrument_ref in (x,y,z,)
> limit 10000 -- (notice 10K now)
>
> The planner decided to do a full table scan on the entire 500M row table!
> And that did not work very well. First I had no clue as to why it did so,
> and when I disabled sequential scan the query immediately returned. But I
> should not have to do so.
>
> I got my first hint of why this problem occurs when I looked at the
> statistics. For the column in question, "instrument_ref" the statistics
> claimed it to be:
>
> The default_statistics_target=500, and analyze has been run.
> select * from pg_stats where attname like 'instr%_ref'; -- Result: *40.000*
> select count(distinct instrumentid_ref) from bigtable -- Result: *33 385
> 922 (!!)*
>
> That is an astonishing difference of almost a 1000X.
>
> When the planner only thinks there are 40K different values, then it makes
> sense to switch to table scan in order to fill the limit=10.000. But it
> is wrong, very wrong, an the query returns in 100s of seconds instead of a
> few.
>
> I have tried to increase the statistics target to 5000, and it helps, but
> it reduces the error to 100X. Still crazy high.
>
> I understand that this is a known problem. I have read previous posts
> about it, still I have never seen anyone reach such a high difference factor.
>
> I have considered these fixes:
> - hardcode the statistics to a particular ratio of the total number of rows
> - randomize the rows more, so that it does not suffer from page
> clustering. However, this has probably other implications
>
> Feel free to comment :)
>
>
> K
>

--
Angular momentum makes the world go 'round.

In response to

n_distinct off by a factor of 1000 at 2020-06-23 12:42:18 from Klaudie Willis

Responses

Re: n_distinct off by a factor of 1000 at 2020-06-23 13:07:14 from Klaudie Willis

Browse pgsql-general by date

	From	Date	Subject
Next Message	Klaudie Willis	2020-06-23 13:07:14	Re: n_distinct off by a factor of 1000
Previous Message	Klaudie Willis	2020-06-23 12:42:18	n_distinct off by a factor of 1000