Re: Freeze avoidance of very large table.

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Freeze avoidance of very large table.
Date: 2015-04-06 17:07:47
Message-ID: 5522BD63.80808@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/6/15 11:12 AM, Sawada Masahiko wrote:
> On Mon, Apr 6, 2015 at 10:17 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
>> On 4/6/15 1:46 AM, Sawada Masahiko wrote:
>>>
>>> On Sun, Apr 5, 2015 at 8:21 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>>>
>>>> On Sat, Apr 4, 2015 at 3:10 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 4/3/15 12:59 AM, Sawada Masahiko wrote:
>>>>>>
>>>>>>
>>>>>> + case HEAPTUPLE_LIVE:
>>>>>> + case HEAPTUPLE_RECENTLY_DEAD:
>>>>>> + case HEAPTUPLE_INSERT_IN_PROGRESS:
>>>>>> + case HEAPTUPLE_DELETE_IN_PROGRESS:
>>>>>> + if
>>>>>> (heap_prepare_freeze_tuple(tuple.t_data, freezelimit,
>>>>>> +
>>>>>> mxactcutoff, &frozen[nfrozen]))
>>>>>> +
>>>>>> frozen[nfrozen++].offset
>>>>>> = offnum;
>>>>>> + break;
>>>>>
>>>>>
>>>>>
>>>>> This doesn't seem safe enough to me. Can't there be tuples that are
>>>>> still
>>>>> new enough that they can't be frozen, and are still live?
>>>>
>>>>
>>>>
>>>> Yep. I've set a table to read only while it contained unfreezable
>>>> tuples,
>>>> and the tuples remain unfrozen yet the read-only action claims to have
>>>> succeeded.
>>>>
>>>>
>>>>>
>>>>> Somewhat related... instead of forcing the freeze to happen
>>>>> synchronously,
>>>>> can't we set this up so a table is in one of three states? Read/Write,
>>>>> Read
>>>>> Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would simply change to
>>>>> the
>>>>> appropriate state, and all the vacuum infrastructure would continue to
>>>>> process those tables as it does today. lazy_vacuum_rel would become
>>>>> responsible for tracking if there were any non-frozen tuples if it was
>>>>> also
>>>>> attempting a freeze. If it discovered there were none, AND the table was
>>>>> marked as ReadOnly, then it would change the table state to Frozen and
>>>>> set
>>>>> relfrozenxid = InvalidTransactionId and relminxid = InvalidMultiXactId.
>>>>> AT_SetReadWrite could change relfrozenxid to it's own Xid as an
>>>>> optimization. Doing it that way leaves all the complicated vacuum code
>>>>> in
>>>>> one place, and would eliminate concerns about race conditions with still
>>>>> running transactions, etc.
>>>>
>>>>
>>>>
>>>> +1 here as well. I might want to set tables to read only for reasons
>>>> other
>>>> than to avoid repeated freezing. (After all, the name of the command
>>>> suggests it is a general purpose thing) and wouldn't want to
>>>> automatically
>>>> trigger a vacuum freeze in the process.
>>>>
>>>
>>> Thank you for comments.
>>>
>>>> Somewhat related... instead of forcing the freeze to happen
>>>> synchronously, can't we set this up so a table is in one of three states?
>>>> Read/Write, Read Only, Frozen. AT_SetReadOnly and AT_SetReadWrite would
>>>> simply change to > the appropriate state, and all the vacuum infrastructure
>>>> would continue to process those tables as it does today. lazy_vacuum_rel
>>>> would become responsible for tracking if there were any non-frozen tuples if
>>>> it was also attempting > a freeze. If it discovered there were none, AND the
>>>> table was marked as ReadOnly, then it would change the table state to Frozen
>>>> and set relfrozenxid = InvalidTransactionId and relminxid =
>>>> InvalidMultiXactId. AT_SetReadWrite > could change relfrozenxid to it's own
>>>> Xid as an optimization. Doing it that way leaves all the complicated vacuum
>>>> code in one place, and would eliminate concerns about race conditions with
>>>> still running transactions, etc.
>>>
>>>
>>> I agree with 3 status, Read/Write, ReadOnly and Frozen.
>>> But I'm not sure when we should do to freeze tuples, e.g., scan whole
>>> tables.
>>> I think that the any changes to table are completely
>>> ignored/restricted if table is marked as ReadOnly table,
>>> and it's accompanied by freezing tuples, just mark as ReadOnly.
>>> Frozen table ensures that all tuples of its table completely has been
>>> frozen, so it also needs to scan whole table as well.
>>> e.g., we should need to scan whole table at two times. right?
>>
>>
>> No. You would be free to set a table as ReadOnly any time you wanted,
>> without scanning anything. All that setting does is disable any DML on the
>> table.
>>
>> The Frozen state would only be set by the vacuum code, IFF:
>> - The table state is ReadOnly *at the start of vacuum* and did not change
>> during vacuum
>> - Vacuum ensured that there were no un-frozen tuples in the table
>>
>> That does not necessitate 2 scans.
>>
>
> I understood this comcept, and have question as I wrote below.
>
>>>> +1 here as well. I might want to set tables to read only for reasons
>>>> other than to avoid repeated freezing. (After all, the name of the command
>>>> suggests it is a general purpose thing) and wouldn't want to automatically
>>>> trigger a
>>>> vacuum freeze in the process.
>>>>
>>>> There is another possibility here, too. We can completely divorce a
>>>> ReadOnly mode (which I think is useful for other things besides freezing)
>>>> from the question of whether we need to force-freeze a relation if we create
>>>> a
>>>> FrozenMap, similar to the visibility map. This has the added advantage of
>>>> helping freeze scans on relations that are not ReadOnly in the case of
>>>> tables that are insert-mostly or any other pattern where most pages stay
>>>> all-frozen.
>>>> Prior to the visibility map this would have been a rather daunting
>>>> project, but I believe this could piggyback on the VM code rather nicely.
>>>> Anytime you clear the VM you clearly must clear the FrozenMap as well. The
>>>> logic for
>>>> setting the FM is clearly different, but that would be entirely
>>>> self-contained to vacuum. Unlike the VM, I don't see any point to marking
>>>> special bits in the page itself for FM.
>>>
>>>
>>> I was thinking this idea (FM) to avoid freezing all tuples actually.
>>> As you said, it might not be good idea (or overkill) that the reason
>>> why settings table to read only is avoidance repeated freezing.
>>> I'm attempting to try design FM to avoid freezing relations as well.
>>> Is it enough that each bit of FM has information that corresponding
>>> pages are completely frozen on each bit?
>>
>>
>> If I'm understanding your implied question correctly, I don't think there
>> would actually be any relationship between FM and marking ReadOnly. It would
>> come into play if we wanted to do the Frozen state, but if we have the FM,
>> marking an entire relation as Frozen becomes a lot less useful. What's going
>> to happen with a VACUUM FREEZE once we have FM is that vacuum will be able
>> to skip reading pages if they are all-visible *and* the FM shows them as
>> frozen, whereas today we can't use the VM to skip pages if scan_all is true.
>>
>> For simplicity, I would start out with each FM bit representing a single
>> page. That means the FM would be very similar in operation to the VM; the
>> only difference would be when a bit in the FM was set. I would absolutely
>> split this into 2 patches as well; one for ReadOnly (and skip the Frozen
>> status for now), and one for FM.
>> When I looked at the VM code briefly it occurred to me that it might be
>> quite difficult to have 1 FM bit represent multiple pages. The issue is the
>> locking necessary between VACUUM and clearing a FM bit. In the VM that's
>> handled by the cleanup lock, but that will only work at a page level. We'd
>> need something to ensure that nothing came in and performed DML while the
>> vacuum code was getting ready to set a FM bit. There's probably several ways
>> this could be accomplished, but I think it would be foolish to try and do
>> anything about it in the initial patch. Especially because it's only
>> supposition that there would be much benefit to having multiple pages per
>> bit.
>>
>
> Yes, I will separate the patch into two patches.
>
> I'd like to confirm about whether what I'm thinking is correct here.
> In first version of patch, each FM bit represent a single page is
> imply whether the all tuple of the page completely has been frozen, it
> would be one patch.

Yes.

> The second patch adds 3 states and read-only table which disable to

Actually, I would start simply with ReadOnly and ReadWrite.

As I understand it, the goal here is to prevent huge amounts of periodic
freeze work due to XID wraparound. I don't think we need the Freeze
state to accomplish that.

With a single bit per page in the Frozen Map, checking a 800GB table
would require reading a mere 100MB of FM. That's pretty tiny, and
largely accomplishes the goal.

Obviously it would be nice to eliminate even that 100MB read, but I
suggest you leave that for a 3rd patch. I think you'll find that just
getting the first 2 accomplished will be a significant amount of work.

Also, note that you don't really even need the ReadOnly patch. As long
as you're not actually touching the table at all the FM will eventually
read as everything is frozen; that gets you 80% of the way there. So I'd
suggest starting with the FM, then doing ReadOnly, and only then
attempting to add the Frozen state.

> any write to table. The trigger which changes state from Read/Write to
> Read-Only is ALTER TABLE SET READ ONLY. And the trigger changes from
> Read-Only to Frozen is vacuum only when the table has been marked as
> Read-Only at vacuum is started *and* the vacuum did not any freeze
> tuple(including skip the page refer to FM). If we support FM, we would
> be able to avoid repeated freezing whole table even if the table has
> not been marked as Read-Only.
>
> In order to change state to Frozen, we need to do VACUUM FREEZE or
> wait for running of auto vacuum. Generally, the threshold of cutoff
> xid is different between VACUUM (and autovacuum) and VACUUM FREEZE. We
> would not expect to change status using by explicit vacuum and
> autovacuum. Inevitably, we would need to do both command ALTER TABLE
> SET READ ONLY and VACUUM FREEZE to change state to Frozen.
> I think that we should also add DDL which does both freezing tuple and
> changing state in one pass, like ALTER TABLE SET READ ONLY WITH FREEZE
> or ALTER TABLE SET FROZEN.
>
> Regards,
>
> -------
> Sawada Masahiko
>

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ktm@rice.edu 2015-04-06 17:29:51 Re: Freeze avoidance of very large table.
Previous Message David Steele 2015-04-06 16:35:48 Re: Auditing extension for PostgreSQL (Take 2)