Re: BUG #18177: certain queries under certain contexts take multiple orders of magnitude longer compared to v10

From: Robert Leach <rleach(at)princeton(dot)edu>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18177: certain queries under certain contexts take multiple orders of magnitude longer compared to v10
Date: 2023-11-03 17:49:43
Message-ID: F8C38955-0B08-4992-80C2-36DCC37B7BD6@princeton.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I rearranged the order of the responses in this email based on the precedence of the things I've learned...

>>> select * from pg_stat_user_tables where relid = '"DataRepo_peakdata"'::regclass;
>>
>> When I try to execute this command, I get:
>>
>> psycopg2.errors.UndefinedTable: relation "DataRepo_peakdata" does not exist
>
> You'd need to be connected to the same database as that table exists
> in. Your EXPLAIN output had extra formatting in it. I had assumed the
> quotes were there because the table was created with some upper-case
> characters. PostgreSQL folds unquoted identifiers to lower-case.
> Perhaps you can try without the quotes if you still get that when
> connected to the correct database.

So I finally worked out that only this seems to do the right thing:

cursor.execute("select * from pg_stat_user_tables where relid = '\"DataRepo_peakdata\"'::regclass;")

Before, I was trying:

cursor.execute("""select * from pg_stat_user_tables where relid = '"DataRepo_peakdata"'::regclass;""")

and that apparently didn't work.

Anyway, here's what I see:

Slow case (when having run tests from a separate class first):

After the animal/sample load and before the peak data load:
(2911633, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)
After the peak data load and before the *slow* query:
(2914197, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)

These are the speeds of the 4 queries in the test:
pg query time: 0.001s
pg count time: 0.141s
carbon count time: 303.786
nitrogen count time: 300.561 (everything runs slower on my laptop at home, so it's over 200s)

Usually slow case (same as above, but the first time I tried it, it went randomly fast):

After the animal/sample load and before the peak data load:
(2911633, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)
After the peak data load and before the *slow* query:
(2913556, 'public', 'DataRepo_peakdata', 9, 201, 882, 882, 595, 0, 0, 0, 0, 595, 0, 595, None, None, None, None, 0, 0, 0, 0)

Note, in this particular instance, the queries went FAST:
pg query time: 0.002s
pg count time: 0.066s
carbon count time: 0.036s
nitrogen count time: 0.031s

Always Fast case (when only running that one test):

After the animal/sample load and before the peak data load:
(2912274, 'public', 'DataRepo_peakdata', 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, None, None, None, 0, 0, 0, 0)
After the peak data load and before the query:
(2912915, 'public', 'DataRepo_peakdata', 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, None, None, None, 0, 0, 0, 0)

>>> Nothing looks particularly bug like so far. It seems the
>>> pg_class.reltuples estimate for this relation is way out.
>>
>> If this is operating as intended, there are a couple questions I have:
>>
>> 1. In postgres 10, the query always runs in under 1 second, regardless of the context. In postgres 13, depending on the context, why does it almost always takes over 200 seconds?
>
> The reason the query is slow is that the planner thinks that there's
> only 1 row in the DataRepo_peakdata table. In reality, when you
> executed that query there were 7896 rows.
>
> Because the planner thinks there's 1 row. It opts to use the fastest
> method available to join two relations where one side has just 1 row,
> i.e. Nested Loop. Unfortunately, it's a risky plan as
> underestimations have quadratic repercussions. In the v13 query, this
> comes out at this point in the plan:
>
> -> Nested Loop (cost=0.62..37.47 rows=1 width=8)
> (actual time=10492.455..202323.935 rows=8 loops=1)
> Join Filter: ("DataRepo_peakgroup".id =
> "DataRepo_peakdata".peak_group_id)
> Rows Removed by Join Filter: 7888
> -> Index Scan using
> "DataRepo_peakdata_peak_group_id_4dd87f4a" on "DataRepo_peakdata"
> (cost=0.25..8.26 rows=1 width=8) (actual time=0.017..7.149 rows=7896
> loops=1)
> -> Nested Loop (cost=0.38..29.20 rows=1 width=4)
> (actual time=2.942..25.621 rows=1 loops=7896)
>
> You can see the planner thought the index scan would produce 1 row.
> 7896 were found during execution.
>
> Most of the time of the entire query can be accounted for by this plan
> fragment. The average execution time for 1 loop of that 2nd Nested
> Loop quoted above is 25.621 milliseconds. If you multiply that number
> by the number of loops (7896), you get 202303.416 milliseconds.
>
> Because the scan on DataRepo_peakdata contains no clauses, the row
> estimate here is derived from pg_class.reltuples for that index.
> VACUUM and ANALYZE adjust this pg_class column after they run so that
> the planner has a recent estimate on the number of records in the
> table. The estimated tuples for a clause-less scan of an index don't
> come directly from pg_class.reltuples. What happens is it looks at
> relpages and reltuples to calculate the average tuples per page then
> we apply that average over the *current* number of pages in the index.

7896 is the exact number of records that are loaded right before the query. The load is of an excel sheet that has 94 rows and 84 columns:

84 * 94 = 7896

I will point out that the loading is not very fast at all. Those 7896 records (along with other records) have always taken at least 80 seconds to load and the animal sample loading before that always takes at least 10 seconds. It's only the subsequent queries after those loads complete that run unexpectedly slow.

> It would be interesting to see what pg_class reltuples and relpages is
> set to for this index.

If you have a query for me, I can send you the output.

>> 2. In postgres 13, the query runs in under 1 second if I run just the one test that (starting from an empty database) loads data and runs the query. BUT, if I run that one test after having run tests in another python/django test class (which loads data, runs various tests, then destroys all the data that that test class loaded), that query takes over 200 seconds to run. Why would that be?
>
> I don't know what these tests do. If they update some data in the
> tables then perhaps the index becomes bloated and that results in the
> number of pages in the index changing which affects the estimate of
> the number of tuples in the index.
>
> It would be good to get the output of
> pg_relation_size('DataRepo_peakdata_peak_group_id_4dd87f4a') before
> and after the tests run.

All the tests I've been looking at just insert data.

I tried executing "pg_relation_size('DataRepo_peakdata_peak_group_id_4dd87f4a')" as a command, but I get a syntax error.

> Perhaps it goes away after autovacuum runs ANALYZE on the table again.
> I suspect this problem depends on the timing of autovacuum and suspect
> you might see the tests running faster more often if you changed
> autovacuum_naptime to 1 second. You'd probably see them more
> consistently running slower if you set autovacuum_naptime to some
> value higher than it's set to right now.

Based on a hunch from your comment about connecting to the database, I tried connecting specifically to out database when trying to do the vacuum, because I thought maybe that's why doing the vacuum didn't seem to speed things up (and that the early instances I did see a speed improvement was due to the random speedy queries I sometimes see without doing anything different), but I'm guessing that connecting to our database to do the vacuum doesn't work, because whenever I attempt it, the vacuum command just seems to hang forever.

Rob

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2023-11-03 17:58:21 Re: COPY TO CSV produces data that is incompatible/unsafe for \COPY FROM CSV
Previous Message Tom Lane 2023-11-03 15:53:31 Re: BUG #18178: New Restriction on "ON SELECT" rules on tables