Quick Links

Re: batch update query performance

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Hans Drexler <Hans(dot)Drexler(at)humaninference(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: batch update query performance
Date:	2014-04-14 00:02:04
Message-ID:	CAMkU=1xdrL_xTJ_QCUhxb5bHMJwpsTcwkmTDSvFYcDBH1VKQ6w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Tue, Apr 8, 2014 at 6:21 AM, Hans Drexler <
Hans(dot)Drexler(at)humaninference(dot)com> wrote:

> Dear Jeff, Albe and Heikki,
>
> Let me start by thanking you for your time. It is really nice to have a
> real supportive community. Thank you.
>
> After reading the answers, we decided to do an experiment with a
> fillfactor of 40% and dropping the index on the is_grc_002 field (but
> retaining the other indexes.) The experiment showed a reduction in
> run-time to ~125 seconds. That is almost 25 times faster than it was. We
> are now doing more tests to verify this fix. We will send a SOLVED
> message when the fix is verified (unless you state to not bother...)
>
> We think we understand why the improvement works. Let me state our
> understanding here. Please comment if we got it wrong.
>
> Index entries point to record pages. An update on a row results in a new
> row instance. If the new instance can be written in the same page as the
> old instance, then no indexes need to be updated because the index still
> points to the correct page. (Unless the update itself modifies an
> indexed value). By specifying a fillfactor of 40%, there will be room
> for an updated version of each row in the page.
>

This is mostly correct. The index entry does not have *just* a page, it
also has an offset to a slot on that page. However, once it gets to the
page there is a mechanism for chaining slots together, so you can still
find the new version given the slot of an older version on the same page.
(If there were a way to have the index store *just* the page, then it
would be even more useful for HOT, as then only the indexes for the values
actually changed would need to get updates, as opposed to now where every
index needs to be updated if any index needs to be updated. But that would
have other trade-offs)

> We assume (sorry) that vacuuming the table will release the space of the
> old rows, so that we can again do an update query and reuse the freed up
> space in the pages.
>

Once no transaction can possibly be interested in the old version, then a
vacuum can free it up for reuse. In the special case of old HOT-updated
tuples, any other process that happens to visit the page can also clean
them up once they are old enough, not just vacuums.

But if someone has a long running transaction open, even if that
transaction never has and never will touch the table being vacuumed, it
will still prevent the space from being reused.

>
>
> Jeff, answering your question: The update is done after each cycle. It
> will actually also update rows that were already updated before. We
> realize this is actually wasteful.
>
>
> So we might change
>
> update t67cdi_nl_cmp_descr set is_grc_002='Y'
>
> to
>
> update t67cdi_nl_cmp_descr set is_grc_002='Y' where is_grc_002 is null
>

> This will avoid creating new records for records that where already
> changed before. This might give us additional speed improvement.
>

That will probably help a lot. The HOT update code is smart enough to
realize that changing from 'Y' to 'Y' does not prevent the HOT update from
working, but it still needs to find room on the same page for a new copy of
the tuple or else it cannot use the HOT mechanism anyway. Once you add
this restriction to the where clause, you might find that it is better to
leave the index in place and put up with the index updates for those rows
which actually do need to be updated, rather than keep dropping the and
rebuilding the index. It would depend on what proportion of the table is
getting updated each time.

Cheers,

Jeff

In response to

Re: batch update query performance at 2014-04-08 13:21:25 from Hans Drexler

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Borodin Vladimir	2014-04-14 09:46:42	Checkpoint distribution
Previous Message	Andrew W. Gibbs	2014-04-13 00:12:18	Re: query against large table not using sensible index to find very small amount of data