From: | Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> |
---|---|
To: | Patrick Clery <patrick(at)phpforhire(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Comparing user attributes with bitwise operators |
Date: | 2004-09-16 08:11:00 |
Message-ID: | 41494A94.5080800@familyhealth.com.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Sounds like you want a many-to-many table that maps user_ids to match_ids
Then you can put an index over (user_id, match_id) and the search will
be very fast.
Chris
Patrick Clery wrote:
> I'm working on a dating/personals/match-making site, that has used many
> different methods of "match-making", that all seem to be very slow. One I am
> attempting now that seems to be an efficient method of storage, but not the
> best for indexing, is using bitwise operators to compare one person's profile
> to another's.
>
> This method seems to be very fast on a small scale, but I am dealing with a
> large user-base here, in excess of 200,000 users that will be executing this
> search function every time they login (the search results of their profile
> will appear on the main page after they have logged in). I've opted to use
> "label tables" for each possible set of answers. (i.e: Marital Status)
>
> For this table, the set of bits -- bit(5) -- are represented as such:
>
> +-----+------------+
> | Bit | Meaning |
> +-----+------------+
> | 1 | single |
> | 2 | separated |
> | 3 | married |
> | 4 | divorced |
> | 5 | widowed |
> +-----+------------+
>
> Here's the structure of the marital status table:
>
> # \d marital_matrix
> Table "public.marital_matrix"
> Column | Type | Modifiers
> -----------+----------------+-----------------------------------------------------------------------
> member_id | integer | not null default
> nextval('public.marital_matrix_member_id_seq'::text)
> status | bit varying(5) | not null default (0)::bit(5)
> p_status | bit varying(5) | not null default (0)::bit(5)
> Indexes:
> "marital_matrix_pkey" PRIMARY KEY, btree (member_id)
> "idx_marital_matrix" btree ((status::"bit" & p_status::"bit"))
> "idx_marital_matrix_single" btree ((status::"bit" & p_status::"bit"))
> "idx_marital_p_status" btree (p_status)
> "idx_marital_status" btree (status)
> Foreign-key constraints:
> "$1" FOREIGN KEY (member_id) REFERENCES members(member_id) ON DELETE
> CASCADE DEFERRABLE INITIALLY DEFERRED
>
> To give you an idea of the selectivity (NOTE: there are only 50,000 rows, a
> smaller sample than what I will actually be using):
>
> datingsite=> select count(*),status,p_status from marital_matrix group by
> status,p_status;
> count | status | p_status
> -------+--------+----------
> 89 | 00001 | 00000
> 1319 | 00010 | 00000
> 2465 | 00100 | 00000
> 1 | 00100 | 11111
> 46117 | 10000 | 00000
>
> here is the user I'll be comparing against, which has selected that he be
> matched with any but married people:
>
> datingsite=> SELECT * FROM marital_matrix WHERE member_id = 21;
> member_id | status | p_status
> -----------+--------+----------
> 21 | 10000 | 11011
> (1 row)
>
>
>
>
> Here are a few possible comparison methods I can think of (NOTE: tests were
> run on a 2.4Ghz Intel CPU w/ 256M RAM on FreeBSD 4.10:
>
>
> METHOD 1: Any bit that is set in p_status (prefered marital status) of the
> searching user should be set in the potential match's marital status. This is
> the method I'd like to improve, if possible. Running the query twice didn't
> produce a different runtime.
>
> EXPLAIN ANALYZE
> SELECT
> m2.member_id
> FROM
> marital_matrix m1, marital_matrix m2
> WHERE
> m1.member_id = 21 AND
> m2.status & m1.p_status != B'00000';
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.00..2357.79 rows=49742 width=4) (actual
> time=18.062..708.469 rows=47525 loops=1)
> Join Filter: ((("inner".status)::"bit" & ("outer".p_status)::"bit") <>
> B'00000'::"bit")
> -> Index Scan using marital_matrix_pkey on marital_matrix m1
> (cost=0.00..5.01 rows=1 width=9) (actual time=0.035..0.045 rows=1 loops=1)
> Index Cond: (member_id = 21)
> -> Seq Scan on marital_matrix m2 (cost=0.00..1602.91 rows=49991 width=13)
> (actual time=17.966..255.529 rows=49991 loops=1)
> Total runtime: 905.694 ms
> (6 rows)
>
>
> METHOD 2: Specifying the value (I don't think this would make a difference,
> but I'll post anyways):
>
> EXPLAIN ANALYZE
> SELECT
> member_id
> FROM
> marital_matrix
> WHERE
> status & B'11011' != B'00000';
>
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Seq Scan on marital_matrix (cost=0.00..1852.87 rows=49742 width=4) (actual
> time=18.113..281.101 rows=47525 loops=1)
> Filter: (((status)::"bit" & B'11011'::"bit") <> B'00000'::"bit")
> Total runtime: 480.836 ms
> (3 rows)
>
>
> METHOD 3: Checking for one bit only. This is definitely not a "real world"
> example and unacceptable since the p_status column can and will have multiple
> bits. For categories other than "Marital Status", such as "Prefered Hair
> Color", the users are likely to select multiple bits (they choose all that
> apply). This query does use the index, but is still not very fast at all:
>
> EXPLAIN ANALYZE
> SELECT
> member_id
> FROM
> marital_matrix m1
> WHERE
> status & B'10000' = B'10000';
> QUERY
> PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using idx_marital_matrix_single on marital_matrix m1
> (cost=0.00..903.59 rows=250 width=4) (actual time=0.042..258.907 rows=46117
> loops=1)
> Index Cond: (((status)::"bit" & B'10000'::"bit") = B'10000'::"bit")
> Total runtime: 451.162 ms
> (3 rows)
>
> METHOD 4: Using an IN statement. This method seems to be very fussy about
> using the index, and I have at some point made it use the index when there
> are less than 3 possibilites. Also, for fields other than Marital Status,
> users will be able to select many bits for their own profile, which means
> there would be many permutations:
>
> EXPLAIN ANALYZE
> SELECT
> member_id
> FROM
> marital_matrix
> WHERE
> status & B'11011' IN (B'10000',B'01000');
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on marital_matrix (cost=0.00..2602.73 rows=993 width=4) (actual
> time=17.845..288.279 rows=47525 loops=1)
> Filter: ((((status)::"bit" & B'11011'::"bit") = B'10000'::"bit") OR
> (((status)::"bit" & B'11011'::"bit") = B'01000'::"bit") OR (((status)::"bit"
> & B'11011'::"bit") = B'00010'::"bit") OR (((status)::"bit" & B'11011'::"bit")
> = B'00001'::"bit"))
> Total runtime: 488.651 ms
> (3 rows)
>
>
> Method 3 is the only one that used the index, but the only really acceptable
> method here is Method 1.
>
> My questions are...
> - Is there any hope in getting this to use an efficient index?
> - Any mathmaticians know if there is a way to reorder my bitwise comparison to
> have the operator use = and not an != (perhaps to force an index)? (AFAIK,
> the answer to the second question is no)
>
> If anyone could offer any performance tips here I'd really appreciate it. I
> imagine that having this schema wouldn't last an hour with the amount of CPU
> cycles it would be consuming on math operations.
>
> Also, I have read the thread that was posted here by Daniel in August:
> http://archives.postgresql.org/pgsql-performance/2004-08/msg00328.php
>
> I have spoke with Daniel on this issue and we both agree it's very difficult
> to find a solution that can scale to very large sites.
>
> I would very much appreciate any advice that some experienced users may have
> to offer me for such a situation. TIA
>
> Patrick
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2004-09-16 08:44:29 | Re: Comparing user attributes with bitwise operators |
Previous Message | Patrick Clery | 2004-09-16 07:41:37 | Comparing user attributes with bitwise operators |