Re: Very slow 101-feeling design/query..

From: Wells Oliver <wells(dot)oliver(at)gmail(dot)com>
To: Doug Reynolds <mav(at)wastegate(dot)net>
Cc: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Very slow 101-feeling design/query..
Date: 2021-12-11 22:52:07
Message-ID: CAOC+FBVNzwYqWdfUO9ca9=SRPyNwqg2swknpCaG8uDBvnA1FeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

That's an interesting point. On the one hand, it's the data set we've been
given-- UUIDs are the vendor's choice-- on the other hand, we could apply
an integer/bigint identifier per UUID and use that instead for joining
purposes.

On Sat, Dec 11, 2021 at 2:50 PM Doug Reynolds <mav(at)wastegate(dot)net> wrote:

> Keep in mind that using UUIDs as an ID on large data sets will use double
> the memory and index size. Obviously, if that is set in stone, you
> probably can't readily change it, but I'd think twice before using that for
> billions of rows for joining.
>
> Sent from my iPhone
>
> On Dec 11, 2021, at 2:38 PM, Wells Oliver <wells(dot)oliver(at)gmail(dot)com> wrote:
>
> 
> Yeah, I can take a look at that. The other issue is simply doing SELECT *
> FROM vw_joints WHERE game_id = 123 is slow because joints -> plays on
> play_uuid, and the index is on game_id in plays, but that query is slower
> than I want it to be.
>
> On Sat, Dec 11, 2021 at 11:22 AM Mladen Gogala <gogala(dot)mladen(at)gmail(dot)com>
> wrote:
>
>> On 12/11/21 13:32, Wells Oliver wrote:
>>
>> We have hundreds of millions of joint locations across hundreds of
>> thousands of plays across thousands of games. We also have a few hundred
>> games with plays in them, but no joint measurements. The joint data is a
>> measurement of body joints through time (up to 300x a second)
>>
>> So my idea was the plays table contains the game identifier (there is no
>> root game table), and the joints table refers to the plays table by a play
>> UUID.
>>
>> As an example: One game contains 231 plays. There are 396,144 rows of
>> joint data.
>>
>> I partitioned the joints data by month because of the volume.
>>
>> The end result is using the joints view (which joins to plays) is just
>> dead, grinding slow, and I'm trying to think of alternatives that stay
>> within Postgres, i.e. maybe BigQuery and/or RedShift is more of an
>> appropriate solution, but I don't want to go there quite yet.
>>
>> Just trying to optimize this design and open to ideas. Thanks.
>>
>> On Sat, Dec 11, 2021 at 7:58 AM Michel SALAIS <msalais(at)msym(dot)fr> wrote:
>>
>>> Hi
>>>
>>> Using view which does a join in two tables to access things in one of
>>> them is not a good idea. You pay for a join even though it is not
>>> necessary. So if you just want distinct game_id and you don’t care if its
>>> play_uuid is in joints, then you can accelerate using the table plays
>>> directly.
>>>
>>>
>>>
>>> *Michel SALAIS*
>>>
>>> *De :* Wells Oliver <wells(dot)oliver(at)gmail(dot)com>
>>> *Envoyé :* vendredi 10 décembre 2021 23:53
>>> *À :* pgsql-admin <pgsql-admin(at)postgresql(dot)org>
>>> *Objet :* Very slow 101-feeling design/query..
>>>
>>>
>>>
>>> This feels very 101 but I feel like it should be much faster:
>>>
>>>
>>>
>>> A table "joints" with a PK of play_uuid, target_id, joint_seq,
>>> joint_timestamp.
>>>
>>>
>>>
>>> "joints" is partitioned using RANGE on joint_timestamp for monthly
>>> partitions 1/1 - 2/1, 2-1 - 3/1, etc.
>>>
>>>
>>>
>>> "joints" has an FK where play_uuid refers to table "plays" and
>>> column "play_uuid" where "play_uuid" is the PK.
>>>
>>>
>>>
>>> "plays" additionally has an indexed column game_id.
>>>
>>>
>>>
>>> "joints" has 1133,932,391 rows across 12 monthly partitions for 2021,
>>> and "plays has 585,627 rows. We made a view called "vw_joints" which just
>>> does:
>>>
>>>
>>>
>>> SELECT * FROM joints JOIN plays USING (play_id);
>>>
>>>
>>>
>>> Then doing:
>>>
>>>
>>>
>>> SELECT DISTINCT game_id FROM vw_joints
>>>
>>>
>>>
>>> Takes 35-45 minutes. Which seems nuts. We do this kind of design in a
>>> few different plays to normalize things, but it comes at the cost of these
>>> agonizingly slow (and seemingly dead simple) qeuries.
>>>
>>>
>>>
>>> Is there any optimization to do here beyond flattening table and
>>> de-normalizing data? Is the partitioning causing a slowness here? I feel
>>> like partitioning is creating some difficulty...
>>> --
>>>
>>
>> Well, you can create a trigger which would fire whenever row is inserted
>> or deleted and would update joint, game_id and the count in a separate
>> table. That is the usual solution for the problem you described. Also, if
>> there is a small amount of games, you can use hash index instead of B-tree
>> indexes.
>>
>>
>> --
>>
>> Mladen Gogala
>> Database Consultant
>> Tel: (347) 321-1217https://dbwhisperer.wordpress.com
>>
>>
>
> --
> Wells Oliver
> wells(dot)oliver(at)gmail(dot)com <wellsoliver(at)gmail(dot)com>
>
>

--
Wells Oliver
wells(dot)oliver(at)gmail(dot)com <wellsoliver(at)gmail(dot)com>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message dbatoCloud Solution 2021-12-12 05:46:58 package conversion error from oracle to postgresql database !!!!!!!!!
Previous Message Doug Reynolds 2021-12-11 22:50:03 Re: Very slow 101-feeling design/query..