Re: HEAD seems to generate larger WAL regarding GIN index

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jesper Krogh <jesper(at)krogh(dot)cc>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HEAD seems to generate larger WAL regarding GIN index
Date: 2014-04-14 14:18:50
Message-ID: CAHGQGwFNZ3X6Zrt6Ok-WAdR198gdfTomkYaCRwnhfSh2VCviqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
>> On 15/03/14 20:27, Heikki Linnakangas wrote:
>>> That said, I didn't expect the difference to be quite that big when you're
>>> appending to the end of the table. When the new entries go to the end of the
>>> posting lists, you only need to recompress and WAL-log the last posting
>>> list, which is max 256 bytes long. But I guess that's still a lot more WAL
>>> than in the old format.
>>>
>>> That could be optimized, but I figured we can live with it, thanks to the
>>> fastupdate feature. Fastupdate allows amortizing that cost over several
>>> insertions. But of course, you explicitly disabled that...
>>
>> In a concurrent update environment, fastupdate as it is in 9.2 is not really
>> useful. It may be that you can bulk up insertion, but you have no control
>> over who ends up paying the debt. Doubling the amount of wal from
>> gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
>> wal / day, keeping it
>> for some weeks to be able to do PITR. The wal are mainly due to gin-index
>> updates as new data is added and needs to be searchable by users. We do run
>> gzip that cuts it down to 25-30% before keeping the for too long, but
>> doubling this is going to be a migration challenge.
>>
>> If fast-update could be made to work in an environment where we both have
>> users searching the index and manually updating it and 4+ backend processes
>> updating the index concurrently then it would be a good benefit to gain.
>>
>> the gin index currently contains 70+ million records with and average
>> tsvector of 124 terms.
>
> Should we try to install some hack around fastupdate for 9.4? I fear
> the divergence between reasonable values of work_mem and reasonable
> sizes for that list is only going to continue to get bigger. I'm sure
> there's somebody out there who has work_mem = 16GB, and stuff like
> 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
> appeal of large values.

Controlling the threshold of the size of pending list only by GUC doesn't
seem reasonable. Users may want to increase the threshold only for the
GIN index which can be updated heavily, and decrease it otherwise. So
I think that it's better to add new storage parameter for GIN index to control
the threshold, or both storage parameter and GUC.

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2014-04-14 14:40:15 Re: Autonomous Transaction (WIP)
Previous Message Michael Paquier 2014-04-14 14:13:49 Re: Patch: add psql tab completion for event triggers