From: | Marcus Engene <mengpg2(at)engene(dot)se> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: where in (select array) |
Date: | 2008-11-20 11:45:53 |
Message-ID: | 49254DF1.1050400@engene.se |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Richard Huxton wrote:
> Marcus Engene wrote:
>
>> Hi List,
>>
>> I have the might_like table that contains products a user might like if
>> he likes the present one (item).
>>
>> CREATE TABLE might_like
>> (
>> item INTEGER NOT NULL
>> ,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
>> ,child INTEGER NOT NULL
>> )
>> WITHOUT OIDS;
>>
>> CREATE INDEX might_like_x1 ON might_like(item);
>>
>> Since there are (will be) houndreds of thousands of items, and 20+ might
>> like items, i thought it would be nice to reduce the set to 1/20th by
>> using a vector.
>>
>> CREATE TABLE might_like_vector
>> (
>> item INTEGER NOT NULL
>> ,created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
>> ,child_arr INTEGER[]
>> )
>> WITHOUT OIDS;
>>
>
> You haven't reduced the set at all, you've just turned part of it
> sideways. You might gain something on your search, but I'm guessing
> you've not tested it.
>
> Hmm - the attached script generates 100,000 items and 10 liked ones for
> each (well for the first 99,990 it says you like the next 10 items).
> They're all given different timestamps at day intervals which means
> you'll end up with 6 or seven matches for you sample query.
>
Sorry, I was a bit unclear. This is run on a web server. The might like
rows are generated if they are not available for an item when the
corresponding page is generated. The one row per might-like-item is
online since yesterday and has when this is written 1/2 M rows in it.
The caching is primarily initiated by a google indexer agent.
This might-like generation is expensive so caching them in the db is a
must and the used CPU of the web-server halfed when this caching had
been put live and cached the most commonly used items.
When the might-like data is read from the database, the generated
presentation html is put in memcached with a timeout of 1h (presently).
Memcached here is probably way overkill, but using it in situations like
this makes the site more scaleable to add webservers and postpones the
problem of clustering pg.
So with memcached I care less about saving a few mS in select latency
and more about postponing other approaching problems like having the
dbdump manageble. Right now it's a 100MB gzipped dump, which is very
manageable, so where it's possible I'd like to keep the data compact. I
imagine it's cheaper disk & dump wise to do the array thing in this and
some other similar usages, and therefore it would be nice to have a
non-ugly usage pattern. Also, we're going to collect usage statistics to
further optimize the behavior of the site, and I'm really worried about
these millions of rows.
If this is a bad approach to the problem I'm very interested to hear
this. Regardless of the best approach, a "proper" solution to the
subquery in would be super appreciated too :)
Thanks for your answer!
Best regards,
Marcus
From | Date | Subject | |
---|---|---|---|
Next Message | Sam Mason | 2008-11-20 11:49:16 | Re: date range query help |
Previous Message | Sam Mason | 2008-11-20 11:29:20 | Re: return MAX and when it happened |