Re: 8.4.4, 9.0, and 9.1 Planner Differences

From: Anthony Presley <anthony(at)resolution(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>, Erik Van Gilder <evg(at)resolution(dot)com>
Subject: Re: 8.4.4, 9.0, and 9.1 Planner Differences
Date: 2011-10-22 23:12:51
Message-ID: CAO2AxyrxQ=4sD7OnUuz8AADBs5176M=dQ-MhFbWLGJ3aDj=kJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Oct 22, 2011 at 10:58 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Anthony Presley <anthony(at)resolution(dot)com> writes:
> > We have a dev machine running 9.0.1 (an i3 laptop, with a regular hard
> disk,
> > with 4GB of RAM, and a mostly untuned postgresql.conf file). The changed
> > lines are:
> > shared_buffers = 512MB
> > temp_buffers = 48MB
> > work_mem = 32MB
> > maintenance_work_mem = 348MB
> > checkpoint_segments = 10
> > effective_cache_size = 512MB
>
> > The same database is loaded onto a production server running 9.1.1 (dual
> QC
> > processors, RAID-10 SAS drives, 36GB of RAM), which replicates to a
> backup
> > server. This has a lot of changed properties:
> > shared_buffers = 8500MB
> > work_mem = 35MB
> > maintenance_work_mem = 512MB
> > wal_level = hot_standby
> > checkpoint_segments = 50
> > max_wal_senders = 3
> > wal_keep_segments = 144
> > random_page_cost = 1.0
> > effective_cache_size = 16384MB
> > effective_io_concurrency = 6
>
> That random_page_cost setting is going to have a huge effect on the
> planner's choices, and the larger effective_cache_size setting will
> likely affect plans too. I don't find it surprising in the least
> that you're getting different plan choices ... and even less so when
> your "dev" and "production" DBs aren't even the same major version.
> You might want to think about making your dev environment more like
> your production.
>

Tom - thanks for your input.

Upgrading to 9.1.1 on the dev box is certainly next on our list ... I like
to make sure that the dev team uses a MUCH slower box than the production
server, making sure that if the developers are making things fast for the
machines, it's really fast on the production box. For all of our queries
except this one, this strategy is "working".

> The same DB is loaded on both the production and the dev environment, and
> in
> > all cases (about 5000 distinct different queries), the production
> > environment is about 500x faster, except for one type of query (both
> > databases were loaded from the same pg_dump on an 8.4.4 database):
>
> > On the dev box, we have: http://explain.depesz.com/s/rwU - about
> 131
> > seconds
> > On the production box, we have: http://explain.depesz.com/s/3dt -
> > about .25 seconds
>
> Did you mislabel these? Because if you didn't, the numbers are right
> in line with what you say above. But anyway, the problem with the
> slower query appears to be poor rowcount estimates, leading the planner
> to use a nestloop join when it shouldn't. You haven't provided nearly
> enough context to let anyone guess why the estimates are off, other
> than boilerplate suggestions like making sure the tables have been
> ANALYZEd recently, and maybe increasing the statistics targets.
>

I *did* mis-label them. The 131 seconds is actually the production box.
IE:
production is ... http://explain.depesz.com/s/rwU

The .25 seconds is the development box. IE:
development is ... http://explain.depesz.com/s/3dt

I wasn't surprised that the plans are different. I was surprised that the
development box *spanked* the production system.

Here's the actual query:
select
preference0_.*
from
preference preference0_, location location1_, employee employee2_
where
preference0_.employee_id=employee2_.id
and preference0_.location_id=location1_.id
and location1_.corporation_id=41197
and employee2_.deleted='N'
and preference0_.deleted='N'
and
(preference0_.id not in (
select preference3_.id from preference preference3_, location
location4_, employee employee5_
where preference3_.employee_id=employee5_.id and
preference3_.location_id=location4_.id
and location4_.corporation_id=41197 and employee5_.deleted='N' and
preference3_.deleted='N'
and (preference3_.startDate>'2011-11-03 00:00:00' or
preference3_.endDate<'2011-11-02 00:00:00'))
) and
(preference0_.employee_id in
(select employee6_.id from employee employee6_ inner join app_user
user7_ on employee6_.user_id=user7_.id
inner join user_location userlocati8_ on user7_.id=userlocati8_.user_id,
location location9_
where userlocati8_.location_id=location9_.id and
userlocati8_.location_id=6800 and userlocati8_.deleted='N'
and location9_.deleted='N' and employee6_.deleted='N')
) order by preference0_.date_created;

I have tried setting the statistics on employee.user_id to be 100 and 1000,
and the rest are the default (100).

I've run both an "ANALYZE" and a "VACUUM ANALYZE" on the production system -
both "generally", and on each of the above tables (employee, app_user,
location, preference).

Here's an updated explain of the most recent attempt. About 5 minutes after
I analyzed them:
http://explain.depesz.com/s/G32

What else would I need to provide?

--
Anthony Presley

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Stefan Keller 2011-10-22 23:33:53 hstore query: Any better idea than adding more memory?
Previous Message Tom Lane 2011-10-22 15:58:26 Re: 8.4.4, 9.0, and 9.1 Planner Differences