Re: dynamic schema modeling and performance

From: Rj Ewing <ewing(dot)rj(at)gmail(dot)com>
To: Vincent Elschot <vinny(at)xs4all(dot)nl>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: dynamic schema modeling and performance
Date: 2017-04-14 18:37:47
Message-ID: CAOSSsV3n4_kc=toCSbx0K9Qcu3cU9GFX0OC_PhrjgdjBHOZe+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Some example queries are:

give me all attributes for entity1 where entity1.attribute1 > 1000 and
entity1.attribute15 = "someValue"

give me all attributes for entity1 where entity1.parentId = 1

give me all attributes for entity1 & parent_entity where entity1.attribute2
= "this"

​Nothing too complicated.

Our application level checks more consist of validating the uploaded data
against user defined rules.​ They are fairly basic.

Scale wise...

Nothing huge. In terms of a more traditional schema, where each entity was
a separate table, the largest table might contain 1,000,000 rows, with the
vast majority of them under 10,000. Another issue is that there might be 30
attributes for a given entity. However the distribution of data might look
like 8 of 30 are 100% populated (each row has a value). 4 of 30 are 80%
populated, 15 are 50% populate, and the rest are <25% populated.

On Fri, Apr 14, 2017 at 11:23 AM, Vincent Elschot <vinny(at)xs4all(dot)nl> wrote:

>
> Op 14/04/2017 om 19:03 schreef Rj Ewing:
>
> We do know where we want to end up. We've had the application running for
> a while using a triple store db. We're looking to move away from the triple
> store due to performance issues. Our core concept is that each project can
> define a set of entities and their relations. Each entity has a set of
> attributes. We need to be able to efficiently run ad-hoc queries across
> multiple entities in a project, filtering via the attribute values
>
> What kind of queries are you talking about?
>
>
> I think our business needs probably eliminate the possibility of data
> integrity at the db level. We currently do application level data
> validation.
>
>
> Application level checks can only be done if you exclusively lock the
> database from before you start the check until the moment you nolonger need
> the certainty.
> That's usually a no-go because it makes your data single-user for the
> duration of the check.
>
>
> Regarding EAV, is there a good way to do this? Most everything I read
> online says that EAV is a terrible idea and performance is lack luster.
> However there are 6NF advocators who say that done right, it is highly
> performant. Are there any articles on the correct way to implement EAV?
>
> Performance always depends entirely on what you are doing with it, and on
> what scale. This is something that you should experiment with.
>
>
> would jsonb or eav likely provide better query performance?
>
> Can't tell without more information about what you actually do with the
> data.
>
> But dynamic schemas can be just fine, as long as your application is 100%
> in control over what can and cannot be done,
> and as long as you use separate databases per customer/project/whatever.
> You will probably want to scale up at some point
> and move customers to different servers, so you might aswell take that
> into account before you start.
>
>
>
> On Wed, Apr 12, 2017 at 7:43 AM, Merlin Moncure <mmoncure(at)gmail(dot)com>
> wrote:
>
>> On Tue, Apr 11, 2017 at 12:46 PM, Rj Ewing <ewing(dot)rj(at)gmail(dot)com> wrote:
>> > I'm looking for thoughts on the best way to handle dynamic schemas.
>> >
>> > The application I am developing revolves around user defined entities.
>> Each
>> > entity is a tabular dataset with user defined columns and data types.
>> > Entities can also be related to each other through Parent-Child
>> > relationships. Some entities will be 100% user driven, while others
>> (such as
>> > an entity representing a photo) will be partially user driven (all photo
>> > entities will have common fields + custom user additions).
>> >
>> > I was hoping to get opinions on whether postgresql would be a suitable
>> > backend. A couple of options I have thought of are:
>> >
>> > 1. Each entity is represented as a table in psql. The schema would be
>> > dynamically updated (with limits) when an entity mapping is updated. I
>> > believe that this would provide the best data constraints and allow the
>> best
>> > data normalization. A concern I have is that there could be an enormous
>> > amount of tables generated and the performance impacts this might have
>> in
>> > the future. I could then run elasticsearch as a denormalized cache for
>> > efficient querying and full-text-search.
>> >
>> > 2. Use a nosql database. This provides the "dynamic" schema aspect. A
>> > concern here is the lack of relation support, thus leading to a more
>> > denormalized data structure and the potential for the data to become
>> > corrupted.
>> >
>> > Any opinions on the use of psql for this case, or other options would be
>> > greatly appreciated!
>>
>> Postgres can function as a nosql database -- you can use jsonb for
>> example to archive data in such a way that the data model can be
>> changed without making schema adjustments. Another way to do it is
>> EAV pattern as noted. These might be good strategies if you're not
>> sure where you want to end up.
>>
>> It really comes down to this: how formal do you want your data model
>> to be? Adding formality leads to performance optimizations, exposes
>> your data to the fantastic SQL language, and allows rigorous
>> assumptions to made made from external dependencies and trusted.
>> Formality also brings a degree of inflexibility since your data has to
>> be forced into predetermined structures.
>>
>> merlin
>>
>
>
>
>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Igor Neyman 2017-04-14 18:40:39 Re: Service outage: each postgres process use the exact amount of the configured work_mem
Previous Message Melvin Davidson 2017-04-14 18:36:57 Re: Service outage: each postgres process use the exact amount of the configured work_mem