Quick Links

Re: POC: Extension for adding distributed tracing - pg_tracing

From:	Anthonin Bonnefoy <anthonin(dot)bonnefoy(at)datadoghq(dot)com>
To:	Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>
Subject:	Re: POC: Extension for adding distributed tracing - pg_tracing
Date:	2023-12-07 14:36:12
Message-ID:	CAO6_XqrTwFTakufrMgG5Ha9gz1ZHUO4WwqVTOUOez2cS0XKUag@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Thanks for the review!

> ```
> +-- Worker can take some additional time to end and report their spans
> +SELECT pg_sleep(0.2);
> + pg_sleep
> +----------
> +
> +(1 row)
> ```
>
> Pretty sure this will fail on buildfarm from time to time. Maybe this
> is a case when it is worth using TAP tests? Especially considering the
> support of pg_tracing.notify_channel.

The sleep was unnecessary as the leader waits for all workers to finish
and report. All workers should have added their spans in the shared buffer
before the leader and I removed the sleep from the test. I haven't seen a recent
failure on this test since then.

I did add a TAP test for the notify part since it seems to be the only way
to test it.

> ```
> + Only superusers can change this setting.
> + This parameter can only be set at server start.
> ```
>
> I find this confusing. If the parameter can only be set at server
> start doesn't it mean that whoever has access to postgresql.conf can
> change it?

The 'only superusers...' part was incorrect. This setting was needed to compute
the size of the shmem request so modifying it would require a server start.
Though for this parameter, I've realised that I could use the global's
max_parallel_workers
instead, removing the need to have a max_traced_parallel_workers parameter.

> Perhaps TABLE command should be added to the list, since it's
> basically a syntax sugar for SELECT:

Added.

> Obviously pg_tracing.drop_on_full_buffer = on will not work properly
> with pg_tracing.notify_threshold because there is a race condition
> possible. By the time one receives and processes the notification the
> buffer can be already cleaned.

That's definitely a possibility. Lowering the notification threshold can
lower the chance of the buffer being cleared when handling the notification
but the race can still happen.

For my use cases, spans should be sent as soon as possible and old spans
are less valuable than new spans but other usage would want the opposite,
thus the parameters to control the behavior.

> This makes me think that 1) pg_tracing.drop_on_full_buffer = on could
> be not the best default value

I don't have a strong opinion on the default behaviour, I will switch to
keep buffer on full.

> 2) As a user I would like a support of ring buffer mode. Even if
> implementing this mode is out of scope of this particular patch, I
> would argue that instead of drop_on_full_buffer boolean parameter we
> need something like:
>
> pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
> perhaps_other_future_modes

Having a buffer ring was also my first approach but it adds some complexity
and I've switched to the full drop + truncate query file for the first
iteration.
Ring buffer would definitely be a better experience, I've switched the
parameter to an enum with only keep_on_full and drop_on_full for the time being.

> Do you think it possible to document how much shared memory does the
> extension consume depending on the parameters? Alternatively maybe
> it's worth providing a SQL function that returns this value? As a user
> I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
> are consumed.

Good point. I've updated the doc to give some approximations with a query
to detail the memory usage.

> I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
> redundant and only complicates the interface. This is way below the
> noise floor formed from network delays, process switching by the OS,
> noisy neighbours under the same supervisor, time slowdown by ntpd,
> etc. I don't recall ever needing spans with better than 1 us
> precision.
>
> On top of that if you have `span_start_ns` you should probably rename
> `duration` to `duration_ns` for consistency. I believe it would be
> more convenient to have `duration` in microseconds. Nothing prevents
> us from using fractions of microseconds (e.g. 12.345 us) if necessary.

I was definitely on the fence on this. I've initially tried to match the
precision provided by query instrumentation but that definitely introduced
a lot of complexity. Also, while trace standard specify that start and end
should be in nanoseconds, it turns out that the nanosecond precision is
generally ignored.

I've switched to a start+end in TimestampTz, dropping the ns precision.
Tests had to be rewritten to have a stable order with spans having that can
have the same start/end.

I've also removed some spans: ExecutorStart, ExecutorEnd, PostParse.
Those spans generally had a sub microsecond duration and didn't bring much
value except for debugging pg_tracing itself. To avoid the confusion of
having spans with a 0us duration, removing them altogether seemed better.
ExecutorFinish also has a sub microsecond duration unless it has a nested
query (triggers) so I'm only emitting them in those cases.

> Negative span ids, for sure, are possible, but I don't recall seeing
> many negative ids in my life. I suggest not to surprise the user
> unless this is really necessary.

That one is gonna be difficult. Traceid and spanid are 8 bytes so I'm exporting
them using bigint. PostgreSQL doesn't support unsigned bigint thus the negative
span ids. An alternative would be to use a bytea but bigint are easier to read
and interact with.

The latest patch contains changes around the nanosecond precision drop, test
updates to make them more stable and simplification around span management.

Regards,
Anthonin

On Fri, Nov 17, 2023 at 3:34 PM Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
>
> Hi,
>
> Thanks for the updated patch!
>
> > Some small changes, mostly around making tests less flaky:
> > - Removed the top_span and nested_level in the span output, those were
> > mostly used for debugging
> > - More tests around Parse span in nested queries
> > - Remove explicit queryId in the tests
>
> ```
> +-- Worker can take some additional time to end and report their spans
> +SELECT pg_sleep(0.2);
> + pg_sleep
> +----------
> +
> +(1 row)
> ```
>
> Pretty sure this will fail on buildfarm from time to time. Maybe this
> is a case when it is worth using TAP tests? Especially considering the
> support of pg_tracing.notify_channel.
>
> ```
> + Only superusers can change this setting.
> + This parameter can only be set at server start.
> ```
>
> I find this confusing. If the parameter can only be set at server
> start doesn't it mean that whoever has access to postgresql.conf can
> change it?
>
> ```
> + <varname>pg_tracing.track_utility</varname> controls whether spans
> + should be generated for utility statements. Utility commands are
> + all those other than <command>SELECT</command>,
> <command>INSERT</command>,
> + <command>UPDATE</command>, <command>DELETE</command>, and
> <command>MERGE</command>.
> + The default value is <literal>on</literal>.
> ```
>
> Perhaps TABLE command should be added to the list, since it's
> basically a syntax sugar for SELECT:
>
> ```
> =# create table t(x int);
> =# insert into t values (1), (2), (3);
> =# table t;
> x
> ---
> 1
> 2
> 3
> (3 rows)
> ```
>
> This part worries me a bit:
>
> ```
> + <listitem>
> + <para>
> + <varname>pg_tracing.drop_on_full_buffer</varname> controls whether
> + span buffer should be emptied when <varname>pg_tracing.max_span</varname>
> + spans is reached. If <literal>off</literal>, new spans are dropped.
> + The default value is <literal>on</literal>.
> + </para>
> + </listitem>
> + </varlistentry>
> ```
>
> Obviously pg_tracing.drop_on_full_buffer = on will not work properly
> with pg_tracing.notify_threshold because there is a race condition
> possible. By the time one receives and processes the notification the
> buffer can be already cleaned. This makes me think that 1)
> pg_tracing.drop_on_full_buffer = on could be not the best default
> value 2) As a user I would like a support of ring buffer mode. Even if
> implementing this mode is out of scope of this particular patch, I
> would argue that instead of drop_on_full_buffer boolean parameter we
> need something like:
>
> pg_tracing.buffer_mode = keep_on_full | drop_on_full | ring_buffer |
> perhaps_other_future_modes
>
> ```
> + <para>
> + The module requires additional shared memory proportional to
> + <varname>pg_tracing.max_span</varname>. Note that this
> + memory is consumed whenever the module is loaded, even if
> + no spans are generated.
> + </para>
> ```
>
> Do you think it possible to document how much shared memory does the
> extension consume depending on the parameters? Alternatively maybe
> it's worth providing a SQL function that returns this value? As a user
> I would like to know whether 10 MB, 500 MB or 1.5 GB of shared memory
> are consumed.
>
> ```
> + The start time of a span has two parts: <literal>span_start</literal> and
> + <literal>span_start_ns</literal>. <literal>span_start</literal> is
> a Timestamp
> + with time zone providing up a microsecond precision.
> <literal>span_start_ns</literal>
> + provides the nanosecond part of the span start.
> <literal>span_start + span_start_ns</literal>
> + can be used to get the start with nanosecond precision.
> ```
>
> I would argue that ns ( 1 / 1_000_000_000 sec! ) precision is
> redundant and only complicates the interface. This is way below the
> noise floor formed from network delays, process switching by the OS,
> noisy neighbours under the same supervisor, time slowdown by ntpd,
> etc. I don't recall ever needing spans with better than 1 us
> precision.
>
> On top of that if you have `span_start_ns` you should probably rename
> `duration` to `duration_ns` for consistency. I believe it would be
> more convenient to have `duration` in microseconds. Nothing prevents
> us from using fractions of microseconds (e.g. 12.345 us) if necessary.
>
> ```
> +=# select trace_id, parent_id, span_id, span_start, duration, span_type, ...
> + trace_id | parent_id | span_id |
> ...
> +----------------------+----------------------+----------------------+----
> + 479062278978111363 | 479062278978111363 | 479062278978111363 ...
> + 479062278978111363 | 479062278978111363 | 479062278978111363 ...
> + 479062278978111363 | 479062278978111363 | -7732425595282414585 ...
> ```
>
> Negative span ids, for sure, are possible, but I don't recall seeing
> many negative ids in my life. I suggest not to surprise the user
> unless this is really necessary.
>
> --
> Best regards,
> Aleksander Alekseev

Attachment	Content-Type	Size
v11-0001-Add-pg_tracing-extension-to-contrib.patch	application/octet-stream	315.2 KB

In response to

Re: POC: Extension for adding distributed tracing - pg_tracing at 2023-11-17 14:34:37 from Aleksander Alekseev

Responses

Re: POC: Extension for adding distributed tracing - pg_tracing at 2023-12-11 21:51:17 from Craig Ringer
Re: POC: Extension for adding distributed tracing - pg_tracing at 2024-01-06 15:30:03 from vignesh C

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	John Scalia	2023-12-07 14:39:42	I’ve come across what I think is a bug
Previous Message	Matthias van de Meent	2023-12-07 14:27:10	Re: initdb caching during tests