Re: creating a subset DB efficiently ?

From: hector vass <hector(dot)vass(at)gmail(dot)com>
To: David Gauthier <dfgpostgres(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: creating a subset DB efficiently ?
Date: 2024-03-09 14:56:07
Message-ID: CAJJx+iUCKn01CEXviePH+KwZ1Mu2qt4Z8quVOXjrpvEZsN46+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostgres(at)gmail(dot)com> wrote:

> Here's the situation....
>
> - The DB contains data for several projects.
> - The tables of the DB contain data for all projects (data is not
> partitioned on project name or anything like that)
> - The "project" identifier (table column) exists in a few "parent" tables
> with many child... grandchild,... tables under them connected with foreign
> keys defined with "on delete cascade". So if a record in one of the parent
> table records is deleted, all of its underlying, dependent records get
> deleted too.
> - New projects come in, and old ones need to be removed and "archived" in
> DBs of their own. So there's a DB called "active_projects" and there's a
> DB called "project_a_archive" (identical metadata).
> - The idea is to copy the data for project "a" that's in "active_projects"
> to the "project_a_arhchive" DB AND delete the project a data out of
> "active_projects".
> - Leave "project_a_archive" up and running if someone needs to attach to
> that and get some old/archived data.
>
> The brute-force method I've been using is...
> 1) pg_dump "active_projects" to a (huge) file then populate
> "project_a_archive" using that (I don't have the privs to create database,
> IT creates an empty one for me, so this is how I do it).
> 2) go into the "project_a_archive" DB and run... "delete from par_tbl_1
> where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ",
> etc... leaving only project "a" data in the DB.
> 3) go into the "active_projects" DB and "delete from par_tbl_1 where
> project = 'a' ", etc... removing project "a" from the "active_projects DB.
>
> Ya, not very elegant, it takes a long time and it takes a lot of
> resources. So I'm looking for ideas on how to do this better.
>
> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever. I
> fear it's because it's trying to journal everything in case I want to
> rollback. But this is just in the archive DB and I don't mind taking the
> risk if I can speed this up outside of a transaction. How can I run a
> delete command like this without the rollback recovery overhead ?
>

>(I don't have the privs to create database, IT creates an empty one for
me, so this is how I do it).

That's a shame. You can do something similar with tablespaces
Template your existing schema to create a new schema for the project
(pg_dump -s)
Create tablespace for this new project and schema

You can then move the physical tablespace to cheaper disk and use symbolic
links or... archive and/or back it up at the schema level with pg_dump -n

...as long as you don't put anything in the public schema all you are
really sharing is roles otherwise a bit like a separate database

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message kuldeep singh 2024-03-09 15:02:18 Insert with Jsonb column hangs
Previous Message Robert Treat 2024-03-09 14:27:24 Re: Seeing high query planning time on Azure Postgres Single Server version 11.