Quick Links

Re: Quesion about querying distributed databases

From:	Igor Korot <ikorot01(at)gmail(dot)com>
To:	me nefcanto <sn(dot)1361(at)gmail(dot)com>
Cc:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: Quesion about querying distributed databases
Date:	2025-03-06 03:03:10
Message-ID:	CA+FnnTzjSDE9E=TF56F-EAp6u=oPH2vmGNrjN50H53dXrev1MA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi,

On Wed, Mar 5, 2025, 8:44 PM me nefcanto <sn(dot)1361(at)gmail(dot)com> wrote:

> I once worked with a monolithic SQL Server database with more than 10
> billion records and about 8 Terabytes of data. A single backup took us more
> than 21 days. It was a nightmare. Almost everybody knows that scaling up
> has a ceiling, but scaling out has no boundaries.
>

But then you did the backup incrementally correct?

That should not take the same amount of time...

> Therefore I will never choose a monolithic database design unless it's a
> small project. But my examples are just examples. We predict 100 million
> records per year. So we have to design accordingly. And it's not just sales
> records. Many applications have requirements that are cheap data but vast
> in multitude. Consider a language-learning app that wants to store the
> known words of any learner. 10 thousand learners each knowing 2 thousand
> words means 20 million records. Convert that to 100 thousand learners each
> knowing 7 thousand words and now you almost have a billion records. Cheap,
> but necessary. Let's not dive into telemetry or time-series data.
>

Can you try and see if 1 server with 3 different databases will do?

Having 1 table per database per server is too ugly.

Also please understand - every databae is different. And so it works and
operates differently. What work good in one may not work good in another...

Thank you.

> We initially chose to break the database into smaller databases, because
> it seemed natural for our modularized monolith architecture. And it worked
> great for SQL Server. If you're small, we host them all on one server. If
> you get bigger, we can put heavy databases on separate machines.
>
> However, I don't have experience working with other types of database
> scaling. I have used table partitioning, but I have never used sharding.
>
> Anyway, that's why I asked you guys. However, encouraging me to go back to
> monolith without giving solutions on how to scale, is not helping. To be
> honest, I'm somehow disappointed by how the most advanced open source
> database does not support cross-database querying just like how SQL Server
> does. But if it doesn't, it doesn't. Our team should either drop it as a
> choice or find a way (by asking the experts who built it or use it) how to
> design based on its features. That's why I'm asking.
>
> One thing that comes to my mind, is to use custom types. Instead of
> storing data in ItemCategories and ItemAttributes, store them as arrays in
> the relevant tables in the same database. But then it seems to me that in
> this case, Mongo would become a better choice because I lose the relational
> nature and normalization somehow. What drawbacks have you experienced in
> that sense?
>
> Regards
> Saeed
>
> On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
> wrote:
>
>> On 3/5/25 04:15, me nefcanto wrote:
>> > Dear Laurenz, the point is that I think if we put all databases into
>> one
>> > database, then we have blocked our growth in the future.
>>
>> How?
>>
>> > A monolith database can be scaled only vertically. We have had huge
>> > headaches in the past with SQL Server on Windows and a single database.
>> > But when you divide bounded contexts into different databases, then you
>> > have the chance to deploy each database on a separate physical machine.
>> > That means a lot in terms of performance. Please correct me if I am
>> wrong.
>>
>> And you add the complexity of talking across machines, as well as
>> maintaining separate machines.
>>
>> >
>> > Let's put this physical restriction on ourselves that we have different
>> > databases. What options do we have? One option that comes to my mind,
>> is
>> > to store the ID of the categories in the Products table. This means
>> that
>> > I don't need FDW anymore. And databases can be on separate machines. I
>> > first query the categories database first, get the category IDs, and
>> > then add a where clause to limit the product search. That could be an
>> > option. Array data type in Postgres is something that I think other
>> > RDBMSs do not have. Will that work? And how about attributes? Because
>> > attributes are more than a single ID. I should store the attribute key,
>> > alongside its value. It's a key-value pair. What can I do for that?
>>
>> You seem to be going out of the way to make your life more complicated.
>>
>> The only way you are going to find an answer is set up test cases and
>> experiment. My bet is a single server with a single database and
>> multiple schemas is where you end up, after all that is where you are
>> starting from.
>>
>>
>> >
>> > Thank you for sharing your time. I really appreciate it.
>> > Saeed
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at
>> > <mailto:laurenz(dot)albe(at)cybertec(dot)at>> wrote:
>> >
>> > On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
>> > > That means a solid monolith database. We lose many goodies with
>> that.
>> > > As a real-world example, right now we can import a single
>> database
>> > > from the production to the development to test and troubleshoot
>> data.
>> >
>> > Well, can't you import a single schema then?
>> >
>> > > What if we host all databases on the same server and use FDW.
>> What
>> > > happens in that case? Does it return 100 thousand records and
>> join
>> > > in the memory?
>> >
>> > It will do just the same thing. The performance could be better
>> > because of the reduced latency.
>> >
>> > > Because in SQL Server, when you perform a cross-database query
>> > > (not cross-server) the performance is extremely good, proving
>> that
>> > > it does not return 100 thousand ItemId from
>> Taxonomy.ItemCategories
>> > > to join with ProductId.
>> > >
>> > > Is that the same case with Postgres too, If databases are located
>> > > on one server?
>> >
>> > No, you cannot perform cross-database queries without a foreign
>> > data wrapper. I don't see a reason why the statement shouldn't
>> > perform as well as in SQL Server if you use schemas instead of
>> > databases.
>> >
>> > Yours,
>> > Laurenz Albe
>> >
>>
>> --
>> Adrian Klaver
>> adrian(dot)klaver(at)aklaver(dot)com
>>
>>

In response to

Re: Quesion about querying distributed databases at 2025-03-06 02:43:39 from me nefcanto

Responses

Re: Quesion about querying distributed databases at 2025-03-06 03:43:56 from Rob Sargent

Browse pgsql-general by date

	From	Date	Subject
Next Message	Rob Sargent	2025-03-06 03:43:56	Re: Quesion about querying distributed databases
Previous Message	me nefcanto	2025-03-06 02:43:39	Re: Quesion about querying distributed databases