Re: My honours project - databases using dynamically attached entity-properties

From: Eddie Stanley <eddiewould(at)paradise(dot)net(dot)nz>
To: David Fetter <david(at)fetter(dot)org>
Cc: Edward Stanley <Edward(dot)Stanley(at)mcs(dot)vuw(dot)ac(dot)nz>, Richard Huxton <dev(at)archonet(dot)com>, Sean Utt <sean(at)strateja(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Andrew Hammond <andrew(dot)george(dot)hammond(at)gmail(dot)com>, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: My honours project - databases using dynamically attached entity-properties
Date: 2007-03-14 19:24:59
Message-ID: 45F84C0B.9030306@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Fetter wrote:
> On Wed, Mar 14, 2007 at 12:07:45PM +1300, Edward Stanley wrote:
>
>> On Wed, 14 Mar 2007, David Fetter wrote:
>>
>>> On Tue, Mar 13, 2007 at 05:54:34PM +0000, Richard Huxton wrote:
>>>
>>>> David Fetter wrote:
>>>>
>>>>> On Tue, Mar 13, 2007 at 02:21:37PM +0000, Richard Huxton wrote:
>>>>>
>>>>>> David Fetter wrote:
>>>>>>
>>>>>>> On Tue, Mar 13, 2007 at 09:31:45AM +0000, Richard Huxton wrote:
>>>>>>>
>>>>>>>> * Another good example is the "questionnaire".
>>>>>>>>
>>>>>>> With all due respect, this is a solved problem *without EAV or
>>>>>>> run-time DDL*. The URL below has one excellent approach to this.
>>>>>>>
>>>>>>> <http://www.varlena.com/GeneralBits/110.php>
>>>>>>>
>>>>>> Which broadly speaking was the solution I used for my
>>>>>> questionnaire, except I had a restricted set of types so basically
>>>>>> just coerced them to text and side-stepped the inheritance issue.
>>>>>> To the extent that it's dynamic, it's still just EAV though.
>>>>>>
>>>>> That's precisely the difference between the above solution and
>>>>> yours, and it's the difference between a good design and one that
>>>>> will come up and bit you on the as^Hnkle.
>>>>>
>>>> It's still basically EAV (either approach). The key fault with EAV
>>>> is that the tables have no semantic meaning - answer_int contains
>>>> number of oranges, days since birth and the price of a tube ticket
>>>> in pennies.
>>>>
>>> Stuffing all of those into an answer_int is *precisely* what the end
>>> user must not do. That's pilot error.
>>>
>>>
>>>> Now, with a questionnaire that might not matter because everything
>>>> is an "answer" and you're not necessarily going to do much more than
>>>> count/aggregate it.
>>>>
>>> See above.
>>>
>>>
>>>>>> It doesn't remove the need for run-time DDL if you allow users to
>>>>>> add their own questions.
>>>>>>
>>>>> Sure it does. When a user, who should be talking with you, wants
>>>>> to ask a new kind of question, that's the start of a discussion
>>>>> about what new kind(s) of questions would be generally applicable
>>>>> in the questionnaire schema. Then, when you come to an agreement,
>>>>> you roll it into the new schema, and the whole system gets an
>>>>> improvement.
>>>>>
>>>> Fine, but if you're not letting the user extend the system, then
>>>> it's not really addressing Edward's original posting, is it?
>>>>
>>> It's my contention that Edward's original idea is ill-posed. SQL is
>>> just fine for doing this kind of thing, and it's *not that hard*.
>>>
>>>
>>>> If the user's talking to me, I might as well just write the DDL
>>>> myself - it's the talk that'll take the time, not writing a dozen
>>>> lines of SQL.
>>>>
>>> It's the talk that's the important part. Machines are really bad at
>>> seeing the broader picture. In the attempt to "save" a few minutes'
>>> discussion, he's trying to borrow that time from a system asked to do
>>> things that computers are inherently bad at doing, and every end user
>>> will pay that time back at a very high rate of interest. This is
>>> precisely the kind of false economy that so plagues software
>>> development and maintenance these days.
>>>
>>>
>>>> The interesting part of the problem (from a Comp-Sci point of view)
>>>> is precisely in automating part of that discussion. It's providing
>>>> an abstraction so that you don't end up with a mass of attributes
>>>> while still providing freedom to the user.
>>>>
>>> This freedom and efficiency you're talking about is better supplied,
>>> IMHO, by putting a standard DDL for questionnaires up on a pgfoundry
>>> or an SF.net. That way, improvements to the DDL get spread all over
>>> the world, and a very large amount of wheel reinvention gets avoided.
>>> Reusable components are a big chunk of both freedom and efficiency. :)
>>>
>>> Cheers,
>>> D
>>>
>> Maybe I should rethink the problem a bit - from the very brief
>> initial research I've done, it seems EAV schemas have two common
>> uses:
>>
>
>
>> 1) When new attributes have to be created on-the-fly
>> 2) When the number of possible properties for an entity greatly (orders of
>> magnitude) exceeds the number of properties any one entity is likely to have.
>>
>
> Um, no. The first use case is bad coding practice, and the second is
> a classic case for a join table, which is the standard way to handle
> M:N relationships.
>
>
>> I'm not sure about solving the first problem - there seems to be a lot of
>> debate around this. I can see reasons for and against allowing this. However
>> I think the second is a very real problem. One such example is a patient
>> record system.
>>
>> For each patient we have a table of common data (dob, sex, height, weight etc)
>> but as well as this a patient can present with many symptoms. This might be a
>> table of 40,000 possible symptoms.
>>
>
> Here's how I'd do that:
>
> CREATE TABLE patient (
> patient_id SERIAL PRIMARY KEY, /* for simplicity. Some
> combination of columns in the
> table would also have a UNIQUE
> NOT NULL constraint on it.
> */
> ...
> );
>
> CREATE TABLE symptom (
> symptom_id SERIAL PRIMARY KEY, /* See above. */
> ...
> );
>
> CREATE TABLE patient_presents_with (
> patient_id INTEGER NOT NULL REFERENCES patient(patient_id),
> symptom_id INTEGER NOT NULL REFERENCES symptom(symptom_id),
> UNIQUE(patient_id, symptom_id)
> );
>
>
>> Lets say we want to run a query on these symptoms (using a boolean expression)
>>
>
> I'd use something like the following:
>
> SELECT
> p.patient_id,
> p.f_name,
> p.l_name,
> s.symptom_name,
> s.symptom_desc
> FROM
> patient p
> JOIN
> patient_presents_with ppw
> USING (patient_id)
> JOIN
> symptom s
> USING (symptom_id)
> WHERE
> s.symptom_name = ALL('foo','bar','baz')
> AND
> s.symptom_name = ANY('quux','fleeg');
>
>
Are the ALL and ANY functions new to Postgresql 8? I haven't met them
before.
Anyway this will work for some queries but not others.

What about the following expression? ('foo' && 'bar') | ('baz' && !
'quxx') | 'fleeg'
Maybe I have misunderstood how these functions work, but I don't think
they will handle anything but trivial examples of this problem.

>> to return the patient records which match the query string on the symptoms.
>>
>> (This turns out to be a very similar problem to the 'tags' example I first
>> presented) - assume a similar schema. With more than a couple of symptoms and
>> a complex tree, the resulting SQL can span pages.
>>
>
> Not really. See above :)
>
>
>> When I first started thinking about this project I believed the two problems
>> essentially to be the same class of problem, but this may not be the case.
>>
>
> EAV will bite you. It's not *that* much work to keep its from biting
> you. :)
>
> Cheers,
> D
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-03-14 21:41:49 Re: hash join hashtable size and work_mem
Previous Message Tom Lane 2007-03-14 17:48:44 Re: hash join hashtable size and work_mem