From: | Florian Weimer <fweimer(at)bfk(dot)de> |
---|---|
To: | "Jon Stewart" <jonathan(dot)l(dot)stewart(at)gmail(dot)com> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Creating large database of MD5 hash values |
Date: | 2008-04-11 17:04:00 |
Message-ID: | 82skxsqq67.fsf@mid.bfk.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
* Jon Stewart:
>> BYTEA is slower to load and a bit inconvenient to use from DBI, but
>> occupies less space on disk than TEXT or VARCHAR in hex form (17 vs 33
>> bytes with PostgreSQL 8.3).
> Can you clarify the "slower to load" point? Where is that pain point
> in the postgres architecture?
COPY FROM needs to read 2.5 bytes on average, instead 2, and a complex
form of double-decoding is necessary.
> Storing the values in binary makes intuitive sense to me since the
> data is twice as dense, thus getting you more bang for the buck on
> comparisons, caching, and streaming reads. I'm not too concerned about
> raw convenience, as there's not going to be a lot of code around my
> application.
The main issue is that you can't use the parameter-providing version
of $sth->execute (or things like $sth->selectarray, $sth->do), you
must use explicit binding by parameter index in order to specify the
type information.
> The idea is that you have named sets of hash values, and hash values
> can be in multiple sets.
The ID step is only going to help you if your sets are very large and
you use certain types of joins, I think. So it's better to
denormalize in this case (if that's what you were alluding to in your
original post).
> The big operations will be to calculate the unions, intersections, and
> differences between sets. That is, I'll load a new set into the
> database and then see whether it has anything in common with another
> set (probably throw the results into a temp table and then dump it
> out).
In this case, PostgreSQL's in-memory bitmap indices should give you
most of the effect of your hash <-> ID mapping anyway.
> I will also periodically run queries to determine the size of
> the intersection of two sets for all pairs of sets (in order to
> generate some nice graphs).
I think it's very difficult to compute that efficiently, but I haven't
thought much about it. This type of query might benefit from your
hash <-> ID mapping, however, because the working set is smaller.
--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2008-04-11 18:03:43 | Re: Performance increase with elevator=deadline |
Previous Message | Matthew | 2008-04-11 16:40:02 | Re: Performance increase with elevator=deadline |