Re: SELECT creates millions of temp files in a single directory

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Peter <pmc(at)citylink(dot)dinoex(dot)sub(dot)org>
Cc: "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: SELECT creates millions of temp files in a single directory
Date: 2022-04-23 20:15:21
Message-ID: CAKFQuwZcRXCKceEckh4zrSpqJ_RA4=Dzx4Msw2hcpByz4XyWRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Apr 23, 2022 at 1:00 PM Peter <pmc(at)citylink(dot)dinoex(dot)sub(dot)org> wrote:

> In modern versions of postgres a simple SELECT writes a couple
> of millions of individual temp files into a single directory under
> pgsql_tmp.
> I know of no filesystem that would take such lightly, and even
> ZFS gets some problems with such extremely long directories.
>
>
Your running commentary detracts greatly from the problem you are trying to
describe and demonstrate.

> What is the rationale in this behaviour and how is it supposed to
> be handled?
>

There are these things called bugs. This may be one of those, which by
definition, are not intentional.

>
> Database size has not increased, postgres memory configuration has
> not been changed, only postgres versions were gradually upgraded
> from 8 to 12.
>

v12 what?

>
> Using these "parallel workers" was not my idea, they came creeping
> along unsolicited with some version upgrade.
>

Well, in theory at least parallelism should be a positive improvement. Why
would we want to require our users to opt-in to the feature?

Switching them OFF deliberately, makes the first query five times
> faster (5 minutes instead of 25 minutes), and makes the second
> query use only 25'000 temp files and successfully deliver 25'000
> result rows (instead of getting stuck with a million temp files),
> apparently one temp file per result row now.
>

In your situation maybe they do not help. Whether that is a fundamental
limitation of parallelism in low memory conditions or whether its specific
to this query I have no clue. But aren't you glad we had the foresight to
allow for the parallelism to be disabled in the case of bugs and/or
situations where it did prove to be harmful?

> So,
> 1. it appears that these "parallel workers" are utterly
> counter-efficient whenever the working set does not fit into
> fast storage, and they need be switched off.
>

Not sure you've "proven" this but you can do so for your setup.

> 2. something with this anti-hash-whatever must be wrong. 25'000
> temp files does still not appear to be a good thing. But,
> delivering code that, by default, allows in excess of a million
> files be written in a single directory, that is just wrong.
>

Probably. It would help if you can provide a self-contained demonstration
that others can then verify and debug (or explain).

> Checking web ressources:
>
> * It seems now the normal behaviour to write millions of files,
> and people seem to just accept this:
> https://stackoverflow.com/q/61696008

If they aren't willing to post to our -bugs or other mailing lists there
isn't much we can do for them.

>
> * Tables with some 50 mio rows seem now to be considered a "high
> row count":
> https://www.postgresql.org/message-id/
> 38E9456A-7841-4F13-B72B-FD3137591972%40gmail.com
> They were not considered a "high row count" back in 2007; they were
> just normal then, and did run fine on machines with 1/50 of the
> memory. :(
>

I don't see how this has any relevance.

Generalizing to "people" from three or four examples is pointless. And,
regardless, it isn't like any of those people are committers for the
project, whose opinions are really the only ones that matter because they
control whether to fix something or not.

> People seem to have been brainwashed by Web-Services and OLTP,
> and now think the working set must always fit in memory. But this
> is only one possible usecase, it is not the exclusive only one.
>
>
Again, your running commentary is providing zero, or negative, value here.

David J.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Karsten Hilbert 2022-04-23 20:36:11 Re: PG14: "is of" vs pg_typeof
Previous Message Karsten Hilbert 2022-04-23 20:14:03 PG14: "is of" vs pg_typeof