Re: Coccinelle for PostgreSQL development [1/N]: coccicheck.py

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Coccinelle for PostgreSQL development [1/N]: coccicheck.py
Date: 2025-03-10 07:08:09
Message-ID: CA+14427LS9jKY2FuLwE5extZHK5DDzjFa_YG88yQ3k16WZ=How@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 2, 2025 at 2:26 PM Mats Kindahl <mats(at)timescale(dot)com> wrote:

> On Sat, Jan 18, 2025 at 8:44 PM Mats Kindahl <mats(at)timescale(dot)com> wrote:
>
>> On Tue, Jan 14, 2025 at 4:19 PM Aleksander Alekseev <
>> aleksander(at)timescale(dot)com> wrote:
>>
>>> IMO the best solution would be re-submitting all the patches to this
>>> thread. Also please make sure the patchset is registered on the
>>> nearest open CF [1] This will ensure that the patchset is built on our
>>> CI (aka cfbot [2]) and will not be lost.
>>>
>>> [1]: https://commitfest.postgresql.org/
>>> [2]: http://cfbot.cputube.org/
>>>
>>>
>>
> Hi all,
>
> Here is a new set of patches rebased on the latest version of the
> Postgres. I decided to just include the semantic patches in each patch
> since it is trivial to generate the patch and build using:
>
>
> ninja coccicheck-patch | patch -d .. -p1 && ninja
>
>
> I repeat the description from the previous patch set and add comments
> where things have changed, but I have also added two semantic patches,
> which are described last.
>
> For those of you that are not aware of it: Coccinelle is a tool for
>> pattern matching and text transformation for C code and can be used for
>> detection of problematic programming patterns and to make complex,
>> tree-wide patches easy. It is aware of the structure of C code and is
>> better suited to make complicated changes than what is possible using
>> normal text substitution tools like Sed and Perl. I've noticed it's been
>> used at a few cases way back to fix issues.[1]
>>
>> Coccinelle have been successfully been used in the Linux project since
>> 2008 and is now an established tool for Linux development and a large
>> number of semantic patches have been added to the source tree to capture
>> everything from generic issues (like eliminating the redundant A in
>> expressions like "!A || (A && B)") to more Linux-specific problems like
>> adding a missing call to kfree().
>>
>> Although PostgreSQL is nowhere the size of the Linux kernel, it is
>> nevertheless of a significant size and would benefit from incorporating
>> Coccinelle into the development. I noticed it's been used in a few cases
>> way back (like 10 years back) to fix issues in the PostgreSQL code, but I
>> thought it might be useful to make it part of normal development practice
>> to, among other things:
>>
>> - Identify and correct bugs in the source code both during development
>> and review.
>> - Make large-scale changes to the source tree to improve the code based
>> on new insights.
>> - Encode and enforce APIs by ensuring that function calls are used
>> correctly.
>> - Use improved coding patterns for more efficient code.
>> - Allow extensions to automatically update code for later PostgreSQL
>> versions.
>>
>> To that end, I created a series of patches to show how it could be used
>> in the PostgreSQL tree. It is a lot easier to discuss concrete code and I
>> split it up into separate messages since that makes it easier to discuss
>> each individual patch. The series contains code to make it easy to work
>> with Coccinelle during development and reviews, as well as examples of
>> semantic patches that capture problems, demonstrate how to make large-scale
>> changes, how to enforce APIs, and also improve some coding patterns.
>>
>> The first three patches contain the coccicheck.py script and the
>> integration with the build system (both Meson and Autoconf).
>>
>> # Coccicheck Script
>>
>> It is a re-implementation of the coccicheck script that the Linux kernel
>> uses. We cannot immediately use the coccicheck script since it is quite
>> closely tied to the Linux source code tree and we need to have something
>> that both supports Autoconf and Meson. Since Python seems to be used more
>> and more in the tree, it seems to be the most natural choice. (I have no
>> strong opinion on what language to use, but think it would be good to have
>> something that is as platform-independent as possible.)
>>
>> The intention is that we should be able to use the Linux semantic patches
>> directly, so it supports the "Requires" and "Options" keywords, which can
>> be used to require a specific version of spatch(1) and add options to the
>> execution of that semantic patch, respectively.
>>
>
> I have added support for using multiple jobs similar to how "make -jN"
> works. This is also supported by the autoconf and ninja builds
>
>
>> # Autoconf support
>>
>> The changes to Autoconf modifies the configure.ac and related files (in
>> particular Makefile.global.in). At this point, I have deliberately not
>> added support for pgxs so extensions cannot use coccicheck through the
>> PostgreSQL installation. This is something that we can add later though.
>>
>> The semantic patches are expected to live in cocci/ directory under the
>> root and the patch uses the pattern cocci/**/*.cocci to find all semantic
>> patches. Right now there are no subdirectories for the semantic patches,
>> but this might be something we want to add to create different categories
>> of scripts.
>>
>> The coccicheck target is used in the same way as for the Linux kernel,
>> that is, to generate and apply all patches suggested by the semantic
>> patches, you type:
>>
>> make coccicheck MODE=patch | patch -p1
>>
>> Linux as support for a few more variables: V to set the verbosity, J to
>> use multiple jobs for processing the semantic patches, M to select a
>> different directory to apply the semantic patches to, and COCCI to use a
>> single specific semantic patch rather than all available. I have not added
>> support for this right now, but if you think this is valuable, it should be
>> straightforward to add.
>>
>> I used autoconf 2.69, as mentioned in configure.ac, but that generate a
>> bigger diff than I expected. Any advice here is welcome.
>>
>
> Using the parameter "JOBS" allow you to use multiple jobs, e.g.:
>
> make coccicheck MODE=patch JOBS=4 | patch -p1
>
>
>
>> # Meson Support
>>
>> The support for Meson is done by adding three coccicheck targets: one for
>> each mode. To apply all patches suggested by the semantic patches using
>> ninja (as is done in [2]), you type the following in the build directory
>> generated by Meson (e.g., the "build/" subdirectory).
>>
>> ninja coccicheck-patch | patch -p1 -d ..
>>
>> If you want to pass other flags you have to set the SPFLAGS environment
>> variable when calling ninja:
>>
>> SPFLAGS=--debug ninja coccicheck-report
>>
>
> If you want to use multiple jobs, you use something like this:
>
> JOBS=4 ninja coccicheck-patch | patch -d .. -p1
>
>
>
>> # Semantic Patch: Wrong type for palloc()
>>
>> This is the first example of a semantic patch and shows how to capture
>> and fix a common problem.
>>
>> If you use an palloc() to allocate memory for an object (or an array of
>> objects) and by mistake type something like:
>>
>> StringInfoData *info = palloc(sizeof(StringInfoData*));
>>
>> You will not allocate enough memory for storing the object. This semantic
>> patch catches any cases where you are either allocating an array of objects
>> or a single object that do not have corret types in this sense, more
>> precisely, it captures assignments to a variable of type T* where palloc()
>> uses sizeof(T) either alone or with a single expression (assuming this is
>> an array count).
>>
>> The semantic patch is overzealous in the sense that using the wrong
>> typedef will suggest a change (this can be seen in the patch). Although the
>> sizes of these are the same, it is probably be better to just follow the
>> convention of always using the type "T*" for any "palloc(sizeof(T))" since
>> the typedef can change at any point and would then introduce a bug.
>> Coccicheck can easily fix this for you, so it is straightforward to enforce
>> this. It also simplifies other automated checking to follow this convention.
>>
>> We don't really have any real bugs as a result from this, but we have one
>> case where an allocation of "sizeof(LLVMBasicBlockRef*)" is allocated to an
>> "LLVMBasicBlockRef*", which strictly speaking is not correct (it should be
>> "sizeof(LLVMBasicBlockRef)"). However, since they are both pointers, there
>> is no risk of incorrect allocation size. One typedef usage that does not
>> match.
>>
>> # Semantic Patch: Introduce palloc_array() and palloc_object() where
>> possible
>>
>> This is an example of a large-scale refactoring to improve the code.
>>
>> For PostgreSQL 16, Peter extended the palloc()/pg_malloc() interface in
>> commit 2016055a92f to provide more type-safety, but these functions are not
>> widely used. This semantic patch captures and replaces all uses of palloc()
>> where palloc_array() or palloc_object() could be used instead. It
>> deliberately does not touch cases where it is not clear that the
>> replacement can be done.
>>
>
> # Semantic Patch: replace code with pg_cmp_*
>
> This is an example of a large-scale refactoring to improve the code.
>
> In commit 3b42bdb4716 and 6b80394781c overflow-safe comparison functions
> were introduced, but they are not widely used. This semantic patch
> identifies some of the more common cases and replaces them with calls to
> the corresponding pg_cmp_* function.
>
> The patches give a few instructions of improvement in performance when
> checking with Godbolt. It's not much, but since it is so easy to apply
> them, it might still be worthwhile.
>
> # Semantic Patch: Replace dynamic allocation of StringInfo with
> StringInfoData
>
> Use improved coding patterns for more efficient code.
>
> This semantic patch replaces uses of StringInfo with StringInfoData where
> the info is dynamically allocated but (optionally) freed at the end of the
> block. This will avoid one dynamic allocation that otherwise has to be
> dealt with.
>
> For example, this code:
>
> StringInfo info = makeStringInfo();
> ...
> appendStringInfo(info, ...);
> ...
> return do_stuff(..., info->data, ...);
>
> Can be replaced with:
>
> StringInfoData info;
> initStringInfo(&info);
> ...
> appendStringInfo(&info, ...);
> ...
> return do_stuff(..., info.data, ...);
>
> It does not do a replacement in these cases:
>
> - If the variable is assigned to an expression. In this case, the
> pointer can "leak" outside the function either through a global variable or
> a parameter assignment.
> - If an assignment is done to the expression. This cannot leak the
> data, but could mean a value-assignment of a structure, so we avoid this
> case.
> - If the pointer is returned.
>
> The cases that this semantic patch fixed when I uploaded the first version
> of the other patches seems to have been dealt with, but having it as part
> of the code base prevents such cases from surfacing again.
>
>
>> [1]: https://coccinelle.gitlabpages.inria.fr/website/
>> [2]: https://www.postgresql.org/docs/current/install-meson.html
>>
> --
> Best wishes,
> Mats Kindahl, Timescale
>

Hi all,

There was a problem with the meson.build file causing errors in the build
farm (because spatch is not installed), so here is a new set of patches.
Only the one with the meson.build has changed, but I am unsure how patches
are picked up, so adding a new version of all files here.
--
Best wishes,
Mats Kindahl, Timescale

Attachment Content-Type Size
0001-Add-initial-coccicheck-script.v4.patch text/x-patch 8.4 KB
0003-Add-meson-build-for-coccicheck.v4.patch text/x-patch 4.6 KB
0004-Semantic-patch-for-sizeof-using-palloc.v4.patch text/x-patch 1.7 KB
0002-Create-coccicheck-target-for-autoconf.v4.patch text/x-patch 9.1 KB
0005-Semantic-patch-for-palloc_array-and-palloc_object.v4.patch text/x-patch 4.2 KB
0007-Semantic-patch-to-use-stack-allocated-StringInfoData.v4.patch text/x-patch 4.4 KB
0006-Semantic-patch-for-pg_cmp_-functions.v4.patch text/x-patch 4.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shubham Khanna 2025-03-10 07:08:34 Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
Previous Message Shubham Khanna 2025-03-10 07:06:22 Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.