Quick Links

Re: Quesion about querying distributed databases

From:	me nefcanto <sn(dot)1361(at)gmail(dot)com>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: Quesion about querying distributed databases
Date:	2025-03-06 02:43:39
Message-ID:	CAEHBEOCpxASoNn=u21kaqOn1A-4YPy_mVfgkEjT3wRT5G4ycbg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

I once worked with a monolithic SQL Server database with more than 10
billion records and about 8 Terabytes of data. A single backup took us more
than 21 days. It was a nightmare. Almost everybody knows that scaling up
has a ceiling, but scaling out has no boundaries.

Therefore I will never choose a monolithic database design unless it's a
small project. But my examples are just examples. We predict 100 million
records per year. So we have to design accordingly. And it's not just sales
records. Many applications have requirements that are cheap data but vast
in multitude. Consider a language-learning app that wants to store the
known words of any learner. 10 thousand learners each knowing 2 thousand
words means 20 million records. Convert that to 100 thousand learners each
knowing 7 thousand words and now you almost have a billion records. Cheap,
but necessary. Let's not dive into telemetry or time-series data.

We initially chose to break the database into smaller databases, because it
seemed natural for our modularized monolith architecture. And it worked
great for SQL Server. If you're small, we host them all on one server. If
you get bigger, we can put heavy databases on separate machines.

However, I don't have experience working with other types of database
scaling. I have used table partitioning, but I have never used sharding.

Anyway, that's why I asked you guys. However, encouraging me to go back to
monolith without giving solutions on how to scale, is not helping. To be
honest, I'm somehow disappointed by how the most advanced open source
database does not support cross-database querying just like how SQL Server
does. But if it doesn't, it doesn't. Our team should either drop it as a
choice or find a way (by asking the experts who built it or use it) how to
design based on its features. That's why I'm asking.

One thing that comes to my mind, is to use custom types. Instead of storing
data in ItemCategories and ItemAttributes, store them as arrays in the
relevant tables in the same database. But then it seems to me that in this
case, Mongo would become a better choice because I lose the relational
nature and normalization somehow. What drawbacks have you experienced in
that sense?

Regards
Saeed

On Wed, Mar 5, 2025 at 7:38 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:

> On 3/5/25 04:15, me nefcanto wrote:
> > Dear Laurenz, the point is that I think if we put all databases into one
> > database, then we have blocked our growth in the future.
>
> How?
>
> > A monolith database can be scaled only vertically. We have had huge
> > headaches in the past with SQL Server on Windows and a single database.
> > But when you divide bounded contexts into different databases, then you
> > have the chance to deploy each database on a separate physical machine.
> > That means a lot in terms of performance. Please correct me if I am
> wrong.
>
> And you add the complexity of talking across machines, as well as
> maintaining separate machines.
>
> >
> > Let's put this physical restriction on ourselves that we have different
> > databases. What options do we have? One option that comes to my mind, is
> > to store the ID of the categories in the Products table. This means that
> > I don't need FDW anymore. And databases can be on separate machines. I
> > first query the categories database first, get the category IDs, and
> > then add a where clause to limit the product search. That could be an
> > option. Array data type in Postgres is something that I think other
> > RDBMSs do not have. Will that work? And how about attributes? Because
> > attributes are more than a single ID. I should store the attribute key,
> > alongside its value. It's a key-value pair. What can I do for that?
>
> You seem to be going out of the way to make your life more complicated.
>
> The only way you are going to find an answer is set up test cases and
> experiment. My bet is a single server with a single database and
> multiple schemas is where you end up, after all that is where you are
> starting from.
>
>
> >
> > Thank you for sharing your time. I really appreciate it.
> > Saeed
> >
> >
> >
> >
> >
> > On Wed, Mar 5, 2025 at 3:18 PM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at
> > <mailto:laurenz(dot)albe(at)cybertec(dot)at>> wrote:
> >
> > On Wed, 2025-03-05 at 14:18 +0330, me nefcanto wrote:
> > > That means a solid monolith database. We lose many goodies with
> that.
> > > As a real-world example, right now we can import a single database
> > > from the production to the development to test and troubleshoot
> data.
> >
> > Well, can't you import a single schema then?
> >
> > > What if we host all databases on the same server and use FDW. What
> > > happens in that case? Does it return 100 thousand records and join
> > > in the memory?
> >
> > It will do just the same thing. The performance could be better
> > because of the reduced latency.
> >
> > > Because in SQL Server, when you perform a cross-database query
> > > (not cross-server) the performance is extremely good, proving that
> > > it does not return 100 thousand ItemId from
> Taxonomy.ItemCategories
> > > to join with ProductId.
> > >
> > > Is that the same case with Postgres too, If databases are located
> > > on one server?
> >
> > No, you cannot perform cross-database queries without a foreign
> > data wrapper. I don't see a reason why the statement shouldn't
> > perform as well as in SQL Server if you use schemas instead of
> > databases.
> >
> > Yours,
> > Laurenz Albe
> >
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>
>

In response to

Re: Quesion about querying distributed databases at 2025-03-05 16:08:51 from Adrian Klaver

Responses

Re: Quesion about querying distributed databases at 2025-03-06 03:03:10 from Igor Korot
Re: Quesion about querying distributed databases at 2025-03-06 06:22:38 from Ron Johnson
Re: Quesion about querying distributed databases at 2025-03-06 08:04:31 from Laurenz Albe
Re: Quesion about querying distributed databases at 2025-03-06 13:31:54 from Greg Sabino Mullane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Igor Korot	2025-03-06 03:03:10	Re: Quesion about querying distributed databases
Previous Message	Igor Korot	2025-03-06 00:55:05	Re: Error on query execution