Re: .ready and .done files considered harmful

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "dipesh(dot)pandit(at)gmail(dot)com" <dipesh(dot)pandit(at)gmail(dot)com>
Cc: "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "jeevan(dot)ladhe(at)enterprisedb(dot)com" <jeevan(dot)ladhe(at)enterprisedb(dot)com>, "sfrost(at)snowman(dot)net" <sfrost(at)snowman(dot)net>, "andres(at)anarazel(dot)de" <andres(at)anarazel(dot)de>, "hannuk(at)google(dot)com" <hannuk(at)google(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: .ready and .done files considered harmful
Date: 2021-09-07 17:28:45
Message-ID: CAE7DCDB-80E9-454E-A825-CB62496FB652@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/7/21, 1:42 AM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> I was thinking that the multple-files approch would work efficiently
> but the the patch still runs directory scans every 64 files. As
> Robert mentioned it is still O(N^2). I'm not sure the reason for the
> limit, but if it were to lower memory consumption or the cost to sort,
> we can resolve that issue by taking trying-the-next approach ignoring
> the case of having many gaps (discussed below). If it were to cause
> voluntary checking of out-of-order files, almost the same can be
> achieved by running directory scans every 64 files in the
> trying-the-next approach (and we would suffer O(N^2) again). On the
> other hand, if archiving is delayed by several segments, the
> multiple-files method might reduce the cost to scan the status
> directory but it won't matter since the directory contains only
> several files. (I think that it might be better that we don't go to
> trying-the-next path if we found only several files by running a
> directory scan.) The multiple-files approach reduces the number of
> directory scans if there were many gaps in the WAL file
> sequence. Alghouth theoretically the last max_backend(+alpha?)
> segemnts could be written out-of-order, but I suppose we only have
> gaps only among the several latest files in reality. I'm not sure,
> though..
>
> In short, the trying-the-next approach seems to me to be the way to
> go, for the reason that it is simpler but it can cover the possible
> failures by almost the same measures with the muliple-files approach.

Thanks for chiming in. The limit of 64 in the multiple-files-per-
directory-scan approach was mostly arbitrary. My earlier testing [0]
with different limits didn't reveal any significant difference, but
using a higher limit might yield a small improvement when there are
several hundred thousand .ready files. IMO increasing the limit isn't
really worth it for this approach. For 500,000 .ready files,
ordinarily you'd need 500,000 directory scans. When 64 files are
archived for each directory scan, you need ~8,000 directory scans.
With 128 files per directory scan, you need ~4,000. With 256, you
need ~2000. The difference between 8,000 directory scans and 500,000
is quite significant. The difference between 2,000 and 8,000 isn't
nearly as significant in comparison.

Nathan

[0] https://www.postgresql.org/message-id/3ECC212F-88FD-4FB2-BAF1-C2DD1563E310%40amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-09-07 17:52:39 Re: .ready and .done files considered harmful
Previous Message Bossart, Nathan 2021-09-07 17:09:08 Re: Estimating HugePages Requirements?