Re: Bad Estimate for multi tenant database queries

From: Peter Grman <peter(dot)grman(at)gmail(dot)com>
To: Michael Lewis <mlewis(at)entrata(dot)com>
Cc: PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Bad Estimate for multi tenant database queries
Date: 2019-09-03 22:10:43
Message-ID: CACF7Wx0hwFVdd2bBTB6BF2h6bfLemGuHoDnHS0SOtz=oSF6mnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Michael,

I digged a bit deeper and found an even simpler query, which can perfectly
embody that problem:

explain (ANALYZE, FORMAT text)
select *
from "Reservation"."Reservation" r
join "Rates"."TimeSliceDefinition" t on r."TimeSliceDefinitionId" = t."Id"
where r."AccountCode" = 'OXHV' and t."AccountCode" = 'OXHV'

Nested Loop (cost=448.56..6918.67 rows=54 width=657) (actual
time=1.207..26.874 rows=9322 loops=1)
-> Bitmap Heap Scan on "TimeSliceDefinition" t (cost=4.39..39.99
rows=14 width=131) (actual time=0.013..0.029 rows=14 loops=1)
Recheck Cond: ("AccountCode" = 'OXHV'::text)
Heap Blocks: exact=7
-> Bitmap Index Scan on
"IX_TimeSliceDefinition_AccountCode_EntityId" (cost=0.00..4.39 rows=14
width=0) (actual time=0.009..0.009 rows=14 loops=1)
Index Cond: ("AccountCode" = 'OXHV'::text)
-> Bitmap Heap Scan on "Reservation" r (cost=444.17..491.21 rows=12
width=526) (actual time=1.021..1.755 rows=666 loops=14)
Recheck Cond: (("TimeSliceDefinitionId" = t."Id") AND
("AccountCode" = 'OXHV'::text))
Heap Blocks: exact=4858
-> BitmapAnd (cost=444.17..444.17 rows=12 width=0) (actual
time=0.980..0.980 rows=0 loops=14)
-> Bitmap Index Scan on
"IX_Reservation_TimeSliceDefinitionId" (cost=0.00..13.82 rows=187 width=0)
(actual time=0.057..0.057 rows=692 loops=14)
Index Cond: ("TimeSliceDefinitionId" = t."Id")
-> Bitmap Index Scan on
"IX_Reservation_AccountCode_EntityId" (cost=0.00..427.72 rows=9507
width=0) (actual time=0.980..0.980 rows=9327 loops=13)
Index Cond: ("AccountCode" = 'OXHV'::text)
Planning Time: 0.353 ms
Execution Time: 27.311 ms

Above the query with wrong estimates (factor ~200x off) and below the query
with correct estimates:

explain (ANALYZE, FORMAT text)
select *
from "Reservation"."Reservation" r
join "Rates"."TimeSliceDefinition" t on r."TimeSliceDefinitionId" = t."Id"
where r."AccountCode" = 'OXHV'

Hash Join (cost=557.83..12519.98 rows=9507 width=657) (actual
time=3.290..15.174 rows=9322 loops=1)
Hash Cond: (r."TimeSliceDefinitionId" = t."Id")
-> Bitmap Heap Scan on "Reservation" r (cost=430.10..12367.25 rows=9507
width=526) (actual time=1.931..10.821 rows=9322 loops=1)
Recheck Cond: ("AccountCode" = 'OXHV'::text)
Heap Blocks: exact=4666
-> Bitmap Index Scan on "IX_Reservation_AccountCode_EntityId"
(cost=0.00..427.72 rows=9507 width=0) (actual time=1.398..1.398 rows=9327
loops=1)
Index Cond: ("AccountCode" = 'OXHV'::text)
-> Hash (cost=96.77..96.77 rows=2477 width=131) (actual
time=1.312..1.313 rows=2511 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 466kB
-> Seq Scan on "TimeSliceDefinition" t (cost=0.00..96.77
rows=2477 width=131) (actual time=0.004..0.550 rows=2511 loops=1)
Planning Time: 1.394 ms
Execution Time: 15.641 ms

Given that the AccountCode should be actually the same, the in all
referenced rows, it's really just to double check, in case we have cross
tenant (cross account) references. - The extra "and t."AccountCode" =
'OXHV'" is added by the ORM as a safety net, we'd like to keep it that way,
but postgres shouldn't consider it for the row estimates.

I've tried creating the following statistics:

CREATE STATISTICS MT_ReservationBucket on "AccountCode", "DepartureUtc",
"ArrivalUtc" from "Reservation"."Reservation";
CREATE STATISTICS MT_ReservationDepartureUtc on "AccountCode",
"DepartureUtc" from "Reservation"."Reservation";
CREATE STATISTICS MT_ReservationArrivalUtc on "AccountCode", "ArrivalUtc"
from "Reservation"."Reservation";
CREATE STATISTICS MT_ReservationNoShowFeeId on "AccountCode", "NoShowFeeId"
from "Reservation"."Reservation";
CREATE STATISTICS MT_ReservationTimeSliceDefinitionId on "AccountCode",
"TimeSliceDefinitionId" from "Reservation"."Reservation";
CREATE STATISTICS MT_TimeSliceDefinition on "AccountCode", "Id" from
"Rates"."TimeSliceDefinition";
CREATE STATISTICS MT_NoShowFee on "AccountCode", "Id" from
"Reservation"."NoShowFee";

I tried creating indexes on Id+AccountCode and
TimeSliceDefinitionId+AccountCode.

I tried setting up a foreign key using both columns instead of just one:

alter table "Reservation"."Reservation"
add constraint
"FK_Reservation_TimeSliceDefinition_TimeSliceDefinitionId_test"
foreign key ("AccountCode", "TimeSliceDefinitionId") references
"Rates"."TimeSliceDefinition" ("AccountCode", "Id")
on delete restrict;

I also tried switching default_statistics_target to 10000 and running full
"analyze" again afterwards, with those statistics and other indexes, but
nothing had any effect. (Maybe the row estimate grew to 55 instead of 54.)

At the end, I've found the following presentation:
https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2124/slides/122/Towards%20more%20efficient%20query%20plans%20(2).pdf
- with a reference to this discussion:
https://www.postgresql.org/message-id/flat/3fcfd5e5-6849-34e6-22ab-1b62d191bedb%402ndquadrant.com#d61504c511d4b437505a05fa50047019

If I understood that discussion correctly (and that's a big IF), that
feature in question might be able to solve my problem. - Am I correct? Or
did I completely missunderstand it? Is there anything I can do until that
is released? (We're currently on Postgres 11 on RDS)

Thanks for your time!
Peter

On Tue, Sep 3, 2019 at 8:40 PM Michael Lewis <mlewis(at)entrata(dot)com> wrote:

> CREATE STATISTICS MT_ReservationDepartureUtc (ndistinct) on "AccountCode",
>> "DepartureUtc" from "Reservation"."Reservation";
>> CREATE STATISTICS MT_ReservationArrivalUtc (ndistinct) on "AccountCode",
>> "ArrivalUtc" from "Reservation"."Reservation";
>> CREATE STATISTICS MT_ReservationNoShowFeeId (ndistinct) on "AccountCode",
>> "NoShowFeeId" from "Reservation"."Reservation";
>> CREATE STATISTICS MT_ReservationTimeSliceDefinitionId (ndistinct) on
>> "AccountCode", "TimeSliceDefinitionId" from "Reservation"."Reservation";
>>
>> but that didn't help either
>>
>
> Did you try with 'dependencies' for the statistics_kind or only ndistinct?
> What was default_statistics_target set to at the time you created the
> extended statistics? I am not sure if that value is used, but I would
> assume so.
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message James Sewell 2019-09-03 22:45:57 Re: Upgrade 96 -> 11
Previous Message Kumar, Virendra 2019-09-03 22:08:48 Re: Running a Simple Update Statement Fails, Second Time Suceeds.