From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
Cc: | Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information) |
Date: | 2014-06-19 05:09:00 |
Message-ID: | CA+HiwqGO9RM5ak2kVMTjbYKNthf5oEE7TM3cM_zY1uVWmG8iYg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 01/22/2014 03:39 AM, Tomas Vondra wrote:
>>
>> What annoys me a bit is the huge size difference between the index
>> updated incrementally (by a sequence of INSERT commands), and the index
>> rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
>> (2288 vs. 2035 MB), but is there a chance to improve this?
>
>
> Hmm. What seems to be happening is that pending item list pages that the
> fast update mechanism uses are not getting recycled. When enough list pages
> are filled up, they are flushed into the main index and the list pages are
> marked as deleted. But they are not recorded in the FSM, so they won't be
> recycled until the index is vacuumed. Almost all of the difference can be
> attributed to deleted pages left behind like that.
>
> So this isn't actually related to the packed postinglists patch at all. It
> just makes the bloat more obvious, because it makes the actual size of the
> index size, excluding deleted pages, smaller. But it can be observed on git
> master as well:
>
> I created a simple test table and index like this:
>
> create table foo (intarr int[]);
> create index i_foo on foo using gin(intarr) with (fastupdate=on);
>
> I filled the table like this:
>
> insert into foo select array[-1] from generate_series(1, 10000000) g;
>
> postgres=# \d+i
> List of relations
> Schema | Name | Type | Owner | Size | Description
> --------+------+-------+--------+--------+-------------
> public | foo | table | heikki | 575 MB |
> (1 row)
>
> postgres=# \di+
> List of relations
> Schema | Name | Type | Owner | Table | Size | Description
> --------+-------+-------+--------+-------+--------+-------------
> public | i_foo | index | heikki | foo | 251 MB |
> (1 row)
>
> I wrote a little utility that scans all pages in a gin index, and prints out
> the flags indicating what kind of a page it is. The distribution looks like
> this:
>
> 19 DATA
> 7420 DATA LEAF
> 24701 DELETED
> 1 LEAF
> 1 META
>
> I think we need to add the deleted pages to the FSM more aggressively.
>
> I tried simply adding calls to RecordFreeIndexPage, after the list pages
> have been marked as deleted, but unfortunately that didn't help. The problem
> is that the FSM is organized into a three-level tree, and
> RecordFreeIndexPage only updates the bottom level. The upper levels are not
> updated until the FSM is vacuumed, so the pages are still not visible to
> GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a
> call to IndexFreeSpaceMapVacuum after flushing the pending list, per
> attached patch. I'm slightly worried about the performance impact of the
> IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which
> isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update
> the upper levels of the FSM in a retail-fashion instead.
>
I wonder if you pursued this further?
You recently added a number of TODO items related to GIN index; is it
worth adding this to the list?
--
Amit
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2014-06-19 07:02:34 | Re: WIP patch for multiple column assignment in UPDATE |
Previous Message | Joe Conway | 2014-06-19 04:42:43 | Re: [bug fix] Memory leak in dblink |