Re: Help Optimizing a Summary Query

From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Help Optimizing a Summary Query
Date: 2014-12-11 21:31:29
Message-ID: CAO_YK0W=A0a+e1_JvPQLFzO=+naNRnPi8GP-0YZx2qu8A7JAPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Dec 11, 2014 at 6:52 PM, Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
wrote:

> Thanks Arthur. I don't think there is as big a different between BIGINT
> and INTEGER as you think there is. In fact with an extended filesystem you
> might not see any difference at all.
>
> As I put in the first emal I am using a GIST index on user.name.
>
> I was really more interested in the LEFT OUTER JOINs vs EXISTS queries and
> if there was a better alternative I had not considered.
>
> On Tue, Dec 9, 2014 at 11:44 AM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
>
>> On Tue, Dec 9, 2014 at 4:18 PM, Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
>> wrote:
>>
>>> I'm sorry, I missed a JOIN on the second variation. It is:
>>>
>>> SELECT u.id, u.name, u.imageURL, u.bio,
>>> CASE
>>> WHEN f.friend_id IS NOT NULL THEN 'isFriend'
>>> WHEN s.to_id IS NOT NULL THEN 'hasSentRequest'
>>> WHEN r.to_id IS NOT NULL THEN 'hasReceivedRequest'
>>> ELSE 'none'
>>> END AS 'friendStatus',
>>> (SELECT COUNT(1) AS d
>>> FROM friends f1
>>> JOIN friends f2 ON f1.fiend_id = f2.friend_id
>>> WHERE f1.user_id = 33 AND f2.user_id = u.id)
>>> FROM users u
>>> *LEFT OUTER JOIN friends f ON f.user_id = 33 AND f.friend_id = u.id
>>> <http://u.id>*
>>> LEFT OUTER JOIN friend_requests s ON s.to_id = 33 AND s.from_id = u.id
>>> LEFT OUTER JOIN friend_requests r ON r.to_id = u.id AND r.from_id = 33
>>> WHERE u.id != 33 AND u.name LIKE '%John%' ORDER BY u.name;
>>>
>>>
>>> On Tue, Dec 9, 2014 at 10:15 AM, Robert DiFalco <
>>> robert(dot)difalco(at)gmail(dot)com> wrote:
>>>
>>>> I have users, friends, and friend_requests. I need a query that
>>>> essentially returns a summary containing:
>>>>
>>>> * user (name, imageURL, bio, ...)
>>>> * Friend status (relative to an active user)
>>>> * Is the user a friend of the active user?
>>>> * Has the user sent a friend request to the active user?
>>>> * Has the user received a friend request from the active user?
>>>> * # of mutualFriends
>>>> * Exclude the active user from the result set.
>>>>
>>>> So I have mocked this up two ways but both have complicated query plans
>>>> that will be problematic with large data sets. I'm thinking that my lack of
>>>> deep SQL knowledge is making me miss the obvious choice.
>>>>
>>>> Here's my two query examples:
>>>>
>>>> SELECT u.id, u.name, u.imageURL, u.bio,
>>>> CASE
>>>> WHEN EXISTS(SELECT 1 FROM friends f WHERE f.user_id = 33 AND
>>>> f.friend_id = u.id) THEN 'isFriend'
>>>> WHEN EXISTS(SELECT 1 FROM friend_requests s WHERE s.to_id = 33
>>>> AND s.from_id = u.id) THEN 'hasSentRequest'
>>>> WHEN EXISTS(SELECT 1 FROM friend_requests r WHERE r.to_id = u.id
>>>> AND r.from_id = 33) THEN 'hasReceivedRequest'
>>>> ELSE 'none'
>>>> END AS "friendStatus",
>>>> (SELECT COUNT(1)
>>>> FROM friends f1
>>>> JOIN friends f2 ON f1.friend_id = f2.friend_id
>>>> WHERE f1.user_id = 33 AND f2.user_id = u.id) AS mutualFriends
>>>> FROM users u
>>>> WHERE u.id != 33 AND u.name LIKE 'John%' ORDER BY u.name;
>>>>
>>>> SELECT u.id, u.name, u.imageURL, u.bio,
>>>> CASE
>>>> WHEN f.friend_id IS NOT NULL THEN 'isFriend'
>>>> WHEN s.to_id IS NOT NULL THEN 'hasSentRequest'
>>>> WHEN r.to_id IS NOT NULL THEN 'hasReceivedRequest'
>>>> ELSE 'none'
>>>> END AS 'friendStatus',
>>>> (SELECT COUNT(1) AS d
>>>> FROM friends f1
>>>> JOIN friends f2 ON f1.fiend_id = f2.friend_id
>>>> WHERE f1.user_id = 33 AND f2.user_id = u.id)
>>>> FROM users u
>>>> LEFT OUTER JOIN friend_requests s ON s.to_id = 33 AND s.from_id = u.id
>>>> LEFT OUTER JOIN friend_requests r ON r.to_id = u.id AND r.from_id = 33
>>>> WHERE u.id != 33 AND u.name LIKE 'John%' ORDER BY u.name;
>>>>
>>>> 33 is just the id of the active user I am using for testing. The WHERE
>>>> clause could be anything. I'm just using "u.name" here but I'm more
>>>> concerned about the construction of the result set than the WHERE clause.
>>>> These have more or less similar query plans, nothing that would change
>>>> things factorially. Is this the best I can do or am I missing the obvious?
>>>>
>>>> Here are the tables:
>>>>
>>>>
>>>> CREATE TABLE users (
>>>> id BIGINT,
>>>> name VARCHAR,
>>>> imageURL VARCHAR
>>>> created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
>>>> phone_natl BIGINT, /* National Phone Number */
>>>> country_e164 SMALLINT, /* E164 country code */
>>>> email VARCHAR(255),
>>>> PRIMARY KEY (id),
>>>> UNIQUE (email),
>>>> UNIQUE (phone_natl, country_e164)
>>>> );
>>>>
>>>>
>>>> CREATE TABLE friends (
>>>> user_id BIGINT,
>>>> friend_id BIGINT,
>>>> PRIMARY KEY (user_id, user_id),
>>>> FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
>>>> FOREIGN KEY (friend_id) REFERENCES users(id) ON DELETE CASCADE
>>>> );
>>>> CREATE INDEX idx_friends_friend ON friends(friend_id);
>>>>
>>>> CREATE TABLE friend_requests (
>>>> from_id BIGINT,
>>>> to_id BIGINT,
>>>> created TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
>>>> PRIMARY KEY (from_id, user_id),
>>>> FOREIGN KEY (from_id) REFERENCES users(id) ON DELETE CASCADE,
>>>> FOREIGN KEY (to_id) REFERENCES users(id) ON DELETE CASCADE
>>>> );
>>>> CREATE INDEX idx_friend_requests_to ON friend_requests(to_id);
>>>>
>>>> Let me know if you guys need anything else.
>>>>
>>>>
>>>
>>
>> Hello Robert, none of your schemas worked for me, here's a clean version
>>
>> CREATE TABLE users (
>> id BIGINT,
>> name VARCHAR,
>> imageURL VARCHAR,
>> created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
>> phone_natl BIGINT,
>> country_e164 SMALLINT,
>> email VARCHAR(255),
>> PRIMARY KEY (id),
>> UNIQUE (email),
>> UNIQUE (phone_natl, country_e164)
>> );
>>
>>
>> CREATE TABLE friends (
>> user_id BIGINT,
>> friend_id BIGINT,
>> PRIMARY KEY (user_id, friend_id),
>> FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
>> FOREIGN KEY (friend_id) REFERENCES users(id) ON DELETE CASCADE
>> );
>> CREATE INDEX idx_friends_friend ON friends(friend_id);
>>
>> CREATE TABLE friend_requests (
>> from_id BIGINT,
>> to_id BIGINT,
>> created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
>> PRIMARY KEY (from_id, to_id),
>> FOREIGN KEY (from_id) REFERENCES users(id) ON DELETE CASCADE,
>> FOREIGN KEY (to_id) REFERENCES users(id) ON DELETE CASCADE
>> );
>> CREATE INDEX idx_friend_requests_to ON friend_requests(to_id);
>>
>> I may look into the query later but here's some thoughts
>>
>> * you want an index to speed up name searches on user.name, I suggest
>> pg_trgm + ILIKE as a starting point.
>> * you really expect more than a billion users? If not (or not in the near
>> future) use INT instead. This will save you a significant amount of memory
>> and eventually buy you time later.
>> * as long as you don't hit the disk for the queries you'll be fine, so
>> make sure you have enough memory or use read-slaves with smaller working
>> sets.
>>
>>
>
Really? I double checked and there isn't any mention about GIST in your
first email.
As for the Int/BigInt, It's not a big deal (from 5% to 20% memory savings
on indexes), but it was worth mentioning.

You may try to materialize the author friend list (used in the
common-friend count) with a CTE.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message David G Johnston 2014-12-11 22:55:37 Re: Help Optimizing a Summary Query
Previous Message Robert DiFalco 2014-12-11 20:52:37 Re: Help Optimizing a Summary Query