Re: loop vs. aggregate was: update and group by/aggregate

From: Sam Mason <sam(at)samason(dot)me(dot)uk>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: loop vs. aggregate was: update and group by/aggregate
Date: 2008-08-27 13:47:24
Message-ID: 20080827134724.GH7271@frubble.xen.chris-lamb.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Aug 27, 2008 at 02:58:18PM +0200, Ivan Sergio Borgonovo wrote:
> On Wed, 27 Aug 2008 12:45:42 +0100 Sam Mason <sam(at)samason(dot)me(dot)uk> wrote:
> > Have you been vacuuming (non-full) between runs? and as always,
> > are the stats reasonably up to date?
>
> there is autovacuum running regularly and I:
> vacuumed full, analyze and reindexed everything recently:
> that means that all changes to DB were done BEFORE maintenance stuff
> but then I executes several selects on tables (including the ones
> involved).
> I tried to run the function and the statement 3 times one after the
> other... so they should have been in similar situation.

I'd probably start by doing a normal vacuum, then run a command that
touches every row then run a normal vacuum again. between every
iteration you want to run a normal vaccum otherwise the later runs will
see more garbage than the previous run (updating a row leaves a dead row
behind).

> > > Can anybody explain why aggregates under perform so badly?
>
> > You could try just running the SELECT part to see how long the
> > aggregation takes.
>
> 33sec
> Surprising.

OK, so it's the UPDATE that's taking the time. I'm not experienced
enough to say why though.

> > It's quite easy to outsmart the planner with large amounts of
> > data, but it's surprising how well it does most of the time.
> > Generally you can just write whatever is obvious and the planer
> > will do something good with it. If it doesn't do the right thing
> > then you can worry about performance, rather than most languages
> > where you have to worry about performance from the start.
>
> I really thought that in this case the planner was going to outsmart
> me since well I think in that statement it could see a lot more
> optimisation than me knowing the nature of the data.

an experienced human thinking for several seconds (or probably minutes
in this example) is almost always going to be better than a computer
thinking for a millisecond.

> > have you set work_mem to some obscenely big value?
>
> 32Mb

OK. It's just that it was doing a hash aggregation involving 160MB of
data. That will cause it to disk and I'd think it would prefer to do
something else.

> Tom suggested to raise that number in the range of tens of Mb for
> another problem. I saw the hot spot was at 16Mb and considering I
> was expecting the DB to get bigger I increased it to 32Mb.

32MB seems a reasonable default on modern hardware.

> > You're trimming an extra close square bracket (']') in one of the
> > trim statements and not in the other. If you just do it in one
> > place then you don't have to worry about inconsistency.
>
> Guess: that was a typo. regexp brain context switching :)

I do the same thing far too often, hence I tend to do similar query
rewrites, as I did on yours, to prevent this (human bug/feature) from
happening.

> I do know Knuth and I think I share his opinions. I don't know
> "modern programming people" and I'm alien to the concept of
> "fundamental good".

A lot of programming courses will try and teach you to remove all
constraints from your code, whether they're going to be hit or not.
In the real world, resources are finite and effort has to be spent
appropriately.

In my eyes this is the difference between computer science and software
engineering. Software engineering is just interested in making
something that works now, computer science is about pushing back of the
boundaries of what's possible. Too often the two get confused.

> But well, I'm here to learn. Could you point me to some explanation
> on why it should be a "fundamental good" in DB context?

as in why using TEXT is good over a large VARCHAR ? it's an engineering
choice normally. Some database systems optimize one or the other a
lot more, so people tend to prefer one for arbitrary reasons. In PG
it doesn't really matter in terms of performance and you should use
whichever expresses the data you're trying to store appropriately.

> I do think that most of the time it is worth (and part of the
> problem) to make a reasonable forecast. If I can't really make one
> or it requires too much effort to make a reasonable one at least I
> start developing with an "unreasonable" constraint just to catch some
> errors earlier, drop the constraint and leave a comment.

Yes, however this requires experience of what's worked so far and is
difficult to teach.

Sam

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Nolan 2008-08-27 13:52:03 Re: [GENERAL] PITR - base backup question
Previous Message Phoenix Kiula 2008-08-27 13:36:35 Re: Dumping/Restoring with constraints?