Re: Modification of data in base folder and very large tables

From: Ogden Brash <info(at)litika(dot)com>
To: Jerry Sievers <gsievers19(at)comcast(dot)net>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Modification of data in base folder and very large tables
Date: 2019-10-10 05:01:35
Message-ID: CAFCR_K2+T=iS_oTTOEdK67Za7gN0-2+LeK+m7=vjw9Zu0Jsdtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

First of all, thanks to Jeff for pointing out strace. I had not used it
before and it is quite informative. This is the rather depressing one
minute summary for the pg_restore:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
80.23 0.176089 65 2726 read
16.51 0.036226 12 3040 lseek
1.69 0.003710 11 326 write
0.79 0.001733 7 232 recvfrom
0.77 0.001684 18 96 96 openat
0.02 0.000035 12 3 1 futex
------ ----------- ----------- --------- --------- ----------------
100.00 0.219477 6423 97 total

The two pg_related processes which are doing 99% of the i/o on the machine
are spending less than 2% of all that time actually writing data. No wonder
it is going slowly!

Which takes me to Jerry's question as to whether there are any indexes. My
immediate off-the cuff answer would be "no". I deleted all indexes in the
target database before I did the pg_restore, and the tables do not have any
foreign keys or any other constraints or triggers. I did the restore as a
data only restore so that it would not try to recreate any tables.

However, after studying the details of the strace output and reading about
other people's experiences, I believe that the more accurate answer is that
I did not take into consideration the index that gets created as a result
of the primary key. In this case, the primary key is big int that is
automatically assigned by a sequence.

If each of the tables has about 3+ billion rows, the index is still going
to be pretty large and spread over many files. In the source database that
was backed up, the primary key sequence was sequentially assigned and
written, but as various posprocessing operations were applied and the rows
modified, the data, is probably in a relatively random evenly distributed
order. So I now believe that all the files that are being constantly
touched are not actually the files for the data rows, but the files for the
index, and as the system is reading the data it is jumping around
recreating the index for the primary key based on the random order of the
dta rows it reads.

Sound plausible? I'm still a postgres noob.

As an experiment, I am in the process of clustering the source database
tables by the primary key constraint. I am hoping that if I redo the
pg_dump after that, it will contain the records in more-or-less primary key
order and on the subsequent pg_restore it should not have to spend the vast
majority of the time on reading and seeking.

It is surprising to me that the cluster operations (which also have to
churn through the entire index and all the records) are going *much* faster
than pg_restore. At the rate they are going, they should be finished in
about 12 hours, whereas the pg_restore is nowhere close to finishing and
has been churning for 10 days.

On Wed, Oct 9, 2019 at 4:05 PM Jerry Sievers <gsievers19(at)comcast(dot)net> wrote:

> Ogden Brash <info(at)litika(dot)com> writes:
>
> > I have a question about the files in .../data/postgresql/11/main/
> > base, specifically in relation to very large tables and how they are
> > written.
> >
> > I have been attempting to restore a relatively large database with
> > pg_restore and it has been running for more than a week. (I also
> > have another thread about the same restore related to network vs.
> > local disk I/O)
> >
> > I ran the pg_restore in verbose mode using multiple jobs so I can
> > tell what has finished and what has not. The database had 66 tables
> > and most of them have been restored. Two of the tables were quite
> > large (billions of rows translating to between 1 and 2TB of data on
> > disk for those two tables) and those two tables are pretty much the
> > only things remaining that has not been reported as finished by
> > pg_restore.
> >
> > As the process has been going for a week, I have been tracking the
> > machine (a dedicated piece of hardware, non-virtualized) and have
> > been noticing a progressive slowdown (as tracked by iostat). There is
>
> Do the tables that are being loaded have any indexes on them?
>
> > nothing running on the machine besides postgresql and the server is
> > only doing this restore, nothing else. It is now, on average, running
> > at less than 25% of the speed that it was running four days ago (as
> > measured by rate of I/O).
> >
> > I started to dig into what was happening on the machine and I noticed
> > the following:
> >
> > iotop reports that two postgres processes (I assume each processing
> > one of the two tables that needs to be processed) are doing all the I
> > /O. That makes sense
> >
> > Total DISK READ : 1473.81 K/s | Total DISK WRITE : 617.30 K/s
> > Actual DISK READ: 1473.81 K/s | Actual DISK WRITE: 0.00 B/s
> > TID PRIO USER DISK READ DISK WRITE SWAPIN IO>
> > COMMAND
> > 6601 be/4 postgres 586.44 K/s 7.72 K/s 0.00 % 97.39 % postgres:
> > 11/main: postg~s thebruteff [local] COPY
> > 6600 be/4 postgres 887.37 K/s 601.87 K/s 0.00 % 93.42 % postgres:
> > 11/main: postg~s thebruteff [local] COPY
> > 666 be/3 root 0.00 B/s 7.72 K/s 0.00 % 5.73 % [jbd2/
> > sda1-8]
> > 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
> > 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 %
> > [kthreadd]
> > 4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker/
> > 0:0H]
> >
> > So, the next thing I though I would do is an "lsof" command for each
> > of those two processes to see what they were writing. That was a bit
> > of a surpise:
> >
> > # lsof -p 6600 | wc -l;
> > 840
> >
> > # lsof -p 6601 | wc -l;
> > 906
> >
> > Is that normal? That there be so many open file pointers? ~900 open
> > file pointers for each of the processes?
> >
> > The next I did was go to see the actual data files, to see how many
> > there are. In my case they are in postgresql/11/main/base/24576 and
> > there are 2076 files there. That made sense. However, I found that
> > when I list them by modification date I see something interesting:
> >
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.7
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.8
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.9
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.10
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.11
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.12
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.13
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.14
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.16
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.15
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.17
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.18
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.19
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.21
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.22
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.23
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.24
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.25
> > -rw------- 1 postgres postgres 1073741824 Oct 8 13:05 27083.26
> > -rw------- 1 postgres postgres 19062784 Oct 8 13:05 27082_fsm
> > -rw------- 1 postgres postgres 544489472 Oct 8 13:05 27077.34
> > -rw------- 1 postgres postgres 169705472 Oct 8 13:05 27082.72
> > -rw------- 1 postgres postgres 978321408 Oct 8 13:05 27083.27
> > -rw------- 1 postgres postgres 342925312 Oct 8 13:05 27076.88
> >
> > If you notice, the file size is capped at 1 GB and as the giant table
> > has grown it has added more files in this directory. However, the
> > mysterious thing to me is that it keeps modifying those files
> > constantly - even the ones that are completely full. So for the two
> > large tables it has been restoring all week, the modification time
> > for the ever growing list of files is being updating constantly.
> >
> > Could it be that thats why I am seeing a slowdown over the course of
> > the week - that for some reason as the number of files for the table
> > has grown, the system is spending more and more time seeking around
> > the disk to touch all those files for some reason?
> >
> > Does anyone who understands the details of postgresql's interaction
> > with the file system have an explanation for why all those files
> > which are full are being touched constantly?
> >
> >
> >
> >
>
> --
> Jerry Sievers
> Postgres DBA/Development Consulting
> e: postgres(dot)consulting(at)comcast(dot)net
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andrew Gierth 2019-10-10 12:05:24 Re: Modification of data in base folder and very large tables
Previous Message Jerry Sievers 2019-10-09 23:05:20 Re: Modification of data in base folder and very large tables