Re: inaccurate stats on large tables

From: "Matt Smiley" <mss(at)rentrak(dot)com>
To: <mukhyala(dot)kiran(at)gene(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: inaccurate stats on large tables
Date: 2008-09-08 16:16:16
Message-ID: 48C4ED52.D078.0028.0@rentrak.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Kiran,

You gave great info on your problem.

First, is this the query you're actually trying to speed up, or is it a simplified version? It looks like the optimizer has already chosen the best execution plan for the given query. Since the query has no joins, we only have to consider access paths. You're fetching 58221/37909009 = 0.15% of the rows, so a sequential scan is clearly inappropriate. A basic index scan is likely to incur extra scattered I/O, so a bitmap index scan is favored.

To improve on this query's runtime, you could try any of the following:

- Reorganize the data to reduce this query's scattered I/O (i.e. cluster on "paliasorigin_search3_idx" rather than "paliasorigin_alias_casefold_idx"). Bear in mind, this may adversely affect other queries.

- Increase the cache hit frequency by ensuring the underlying filesystem cache has plenty of RAM (usually so under Linux) and checking that other concurrent queries aren't polluting the cache. Consider adding RAM if you think the working set of blocks required by most queries is larger than the combined Postgres and filesystem caches. If other processes than the db do I/O on this machine, consider them as resource consumers, too.

- Restructure the table, partitioning along a column that would be useful for pruning whole partitions for your painful queries. In this case, origin_id or tax_id seems like a good bet, but again, consider other queries against this table. 38 million rows probably makes your table around 2 GB (guessing about 55 bytes/row). Depending on the size and growth rate of the table, it may be time to consider partitioning. Out of curiosity, what runtime are you typically seeing from this query? The explain-analyze ran in 113 ms, which I'm guessing is the effect of caching, not the runtime you're trying to improve.

- Rebuild the indexes on this table. Under certain use conditions, btree indexes can get horribly bloated. Rebuilding the indexes returns them to their most compact and balanced form. For example: reindex index "paliasorigin_search3_idx"; Apart from the locking and CPU usage during the rebuild, this has no negative consequences, so I'd try this before something drastic like partitioning. First review the current size of the index for comparison: select pg_size_pretty(pg_relation_size('paliasorigin_search3_idx'));

Since you asked specifically about improving the row-count estimate, like the previous responder said, you should consider increasing the statistics target. This will help if individual columns are being underestimated, but not if the overestimate is due to joint variation. In other words, the optimizer has no way to tell if there is there a logical relationship between columns A and B such that certain values in B only occur with certain values of A. Just judging from the names, it sounds like origin_id and tax_id might have a parent-child relationship, so I thought it was worth mentioning.

Do the columns individually have good estimates?
explain analyze select * from paliasorigin where origin_id=20;
explain analyze select * from paliasorigin where tax_id=9606;

If not, increase the statistics on that column, reanalyze the table, and recheck the selectivity estimate:
alter table paliasorigin alter column origin_id set statistics 20;
analyze paliasorigin;
explain analyze select * from paliasorigin where origin_id=20;

Good luck!
Matt

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Scott Marlowe 2008-09-08 16:36:00 bitmap heap scan versus simple index and nested loop
Previous Message Nikolas Everett 2008-09-08 15:17:41 Re: SAN and full_page_writes