Re: UDP buffer drops / statistics collector

From: Tim Kane <tim(dot)kane(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-general(at)postgresql(dot)org>
Subject: Re: UDP buffer drops / statistics collector
Date: 2017-04-20 07:38:42
Message-ID: CADVWZZ+TDwv4rnL8hTme6A8YYAm7CcN8zapDkFh6S6pcEKHVjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ok, fixed it! :D
Posting here for future me (others like me).

It would seem (having not read kernel source) that increasing the kernel
buffer sizes (rmin_default / rmin_max) does *not* take effect for any
processes that are *already* bound or listening to a port/socket. I had
previously assumed this was a global kernel buffer, perhaps not.

I had been trying to manage these buffer sizes to resolve the UDP drop
issues, but I had not at any time restarted the statistics collector
process. I restarted the cluster in a moment of last resort (something I
had tried numerous times *before* playing with the buffer sizes) and lo and
behold!! no more buffer drops!

Problem solved.
The pgss_query_texts.stat still wants to live in the default *pg_stat_tmp*
directory, wether by design or not.. but that's a non-issue for me now.

Thanks for listening :)

On Wed, Apr 19, 2017 at 7:36 PM Tim Kane <tim(dot)kane(at)gmail(dot)com> wrote:

> Well, this is frustrating..
> The buffer drops are still occurring - so I thought it worth trying use a
> ramdisk and set *stats_temp_directory* accordingly.
>
> I've reloaded the instance, and can see that the stats directory is now
> being populated in the new location. *Except* - there is one last file (
> pgss_query_texts.stat) that continues to be updated in the *old* pg_stat_tmp
> path.. Is that supposed to happen?
>
>
> Fairly similar to this guy (but not quite the same).
>
> https://www.postgresql.org/message-id/D6E71BEFAD7BEB4FBCD8AE74FADB1265011BB40FC749@win-8-eml-ex1.eml.local
>
> I can see the packets arriving and being consumed by the collector.. and,
> the collector is indeed updating in the new stats_temp_directory.. just not
> for that one file.
>
>
> It also failed to resolve the buffer drops.. At this point, I'm not sure I
> expected it to. They tend to occur semi-regularly (every 8-13 minutes) but
> I can't correlate them with any kind of activity (and if I'm honest, it's
> possibly starting to drive me a little bit mad).
>
>
>
>
> On Tue, Apr 18, 2017 at 2:53 PM Tim Kane <tim(dot)kane(at)gmail(dot)com> wrote:
>
>> Okay, so I've run an strace on the collector process during a buffer drop
>> event.
>> I can see evidence of a recvfrom loop pulling in a *maximum* of 142kb.
>>
>> While I've had already increased rmem_max, it would appear this is not
>> being observed by the kernel.
>> rmem_default is set to 124kb, which would explain the above read maxing
>> out just slightly beyond this (presuming a ring buffer filling up behind
>> the read).
>>
>> I'm going to try increasing rmem_default and see if it has any positive
>> effect.. (and then investigate why the kernel doesn't want to consider
>> rmem_max)..
>>
>>
>>
>>
>>
>> On Tue, Apr 18, 2017 at 8:05 AM Tim Kane <tim(dot)kane(at)gmail(dot)com> wrote:
>>
>>> Hi all,
>>>
>>> I'm seeing sporadic (but frequent) UDP buffer drops on a host that so
>>> far I've not been able to resolve.
>>>
>>> The drops are originating from postgres processes, and from what I know
>>> - the only UDP traffic generated by postgres should be consumed by the
>>> statistics collector - but for whatever reason, it's failing to read the
>>> packets quickly enough.
>>>
>>> Interestingly, I'm seeing these drops occur even when the system is
>>> idle.. but every 15 minutes or so (not consistently enough to isolate any
>>> particular activity) we'll see in the order of ~90 packets dropped at a
>>> time.
>>>
>>> I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on
>>> the same hardware)
>>>
>>>
>>> If it's relevant.. there are two instances of postgres running (and
>>> consequently, 2 instances of the stats collector process) though 1 of those
>>> instances is most definitely idle for most of the day.
>>>
>>> In an effort to try to resolve the problem, I've increased (x2) the UDP
>>> recv buffer sizes on the host - but it seems to have had no effect.
>>>
>>> cat /proc/sys/net/core/rmem_max
>>> 1677216
>>>
>>> The following parameters are configured
>>>
>>> track_activities on
>>> track_counts on
>>> track_functions none
>>> track_io_timing off
>>>
>>>
>>> There are approximately 80-100 connections at any given time.
>>>
>>> It seems that the issue started a few weeks ago, around the time of a
>>> reboot on the given host... but it's difficult to know what (if anything)
>>> has changed, or why :-/
>>>
>>>
>>> Incidentally... the documentation doesn't seem to have any mention of
>>> UDP whatsoever. I'm going to use this as an opportunity to dive into the
>>> source - but perhaps it's worth improving the documentation around this?
>>>
>>> My next step is to try disabling track_activities and track_counts to
>>> see if they improve matters any, but I wouldn't expect these to generate
>>> enough data to flood the UDP buffers :-/
>>>
>>> Any ideas?
>>>
>>>
>>>
>>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rafia Sabih 2017-04-20 09:28:18 Re: Why so long?
Previous Message David G. Johnston 2017-04-20 06:48:32 Unable to upload backups