From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <ringerc(at)ringerc(dot)id(dot)au>, Harold A(dot) Giménez <harold(dot)gimenez(at)gmail(dot)com> |
Subject: | Re: [PERFORM] DELETE vs TRUNCATE explanation |
Date: | 2012-07-15 00:10:18 |
Message-ID: | CAMkU=1yLXvODRZZ_=fgrEeJfk2tvZPTTD-8n8BwrAhNz_WBT0A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
On Thu, Jul 12, 2012 at 9:55 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> I've moved this thread from performance to hackers.
>
> The topic was poor performance when truncating lots of small tables
> repeatedly on test environments with fsync=off.
>
> On Thu, Jul 12, 2012 at 6:00 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> I think the problem is in the Fsync Absorption queue. Every truncate
>> adds a FORGET_RELATION_FSYNC to the queue, and processing each one of
>> those leads to sequential scanning the checkpointer's pending ops hash
>> table, which is quite large. It is almost entirely full of other
>> requests which have already been canceled, but it still has to dig
>> through them all. So this is essentially an N^2 operation.
...
>
>> I'm not sure why we don't just delete the entry instead of marking it
>> as cancelled. It looks like the only problem is that you can't delete
>> an entry other than the one just returned by hash_seq_search. Which
>> would be fine, as that is the entry that we would want to delete;
>> except that mdsync might have a different hash_seq_search open, and so
>> it wouldn't be safe to delete.
The attached patch addresses this problem by deleting the entry when
it is safe to do so, and flagging it as canceled otherwise.
I thought of using has_seq_scans to determine when it is safe, but
dynahash.c does not make that function public, and I was afraid it
might be too slow, anyway.
So instead I used a static variable, plus the knowledge that the only
time there are two scans on the table is when mdsync starts one and
then calls RememberFsyncRequest indirectly. There is one other place
that does a seq scan, but there is no way for control to pass from
that loop to reach RememberFsyncRequest.
I've added code to disclaim the scan if mdsync errors out. I don't
think that this should a problem because at that point the scan object
is never going to be used again, so if its internal state gets screwed
up it shouldn't matter. However, I wonder if it should also call
hash_seq_term, otherwise the pending ops table will be permanently
prevented from expanding (this is a pre-existing condition, not to do
with my patch). Since I don't know what can make mdsync error out
without being catastrophic, I don't know how to test this out.
One concern is that if the ops table ever does become bloated, it can
never recover while under load. The bloated table will cause mdsync
to take a long time to run, and as long as mdsync is in the call stack
the antibloat feature is defeated--so we have crossed a tipping point
and cannot get back. I don't see that occurring in the current use
case, however. With my current benchmark, the anti-bloat is effective
enough that mdsync never takes very long to execute, so a virtuous
circle exists.
As an aside, the comments in dynahash.c seem to suggest that one can
always delete the entry returned by hash_seq_search, regardless of the
existence of other sequential searches. I'm pretty sure that this is
not true. Also, shouldn't this contract about when one is allowed to
delete entries be in the hsearch.h file, rather than the dynahash.c
file?
Also, I still wonder if it is worth memorizing fsyncs (under
fsync=off) that may or may not ever take place. Is there any
guarantee that we can make by doing so, that couldn't be made
otherwise?
Cheers,
Jeff
Attachment | Content-Type | Size |
---|---|---|
FsyncRequest_delete_v1.patch | application/octet-stream | 4.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2012-07-15 01:17:22 | Re: Use of rsync for data directory copying |
Previous Message | Josh Berkus | 2012-07-14 23:54:22 | Re: Synchronous Standalone Master Redoux |
From | Date | Subject | |
---|---|---|---|
Next Message | Ioannis Anagnostopoulos | 2012-07-15 01:14:45 | Index slow down insertions... |
Previous Message | Craig Ringer | 2012-07-14 15:10:52 | Re: Any tool/script available which can be used to measure scalability of an application's database. |