Re: Why could different data in a table be processed with different performance?

From: Vladimir Ryabtsev <greatvovan(at)gmail(dot)com>
To: f(dot)pardi(at)portavita(dot)eu
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Why could different data in a table be processed with different performance?
Date: 2018-09-24 22:28:15
Message-ID: CAMqTPqnUB54YXQutPKPWdoUKEZOnV58WndfTvZagrNLiTr6RHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> You can create 2 partial indexes and the planner will pick it up for you.
(and the planning time will go a bit up).
Created two partial indexes and ensured planner uses it. But the result is
still the same, no noticeable difference.

> it is not unusual to have 1GB cache or more... and do not forget to drop
the cache between tests + do a sync
I conducted several long runs of dd, so I am sure that this numbers are
fairly correct. However, what worries me is that I test sequential read
speed while during my experiments Postgres might need to read from random
places thus reducing real read speed dramatically. I have a feeling that
this can be the reason.
I also reviewed import scripts and found the import was done in DESCENDING
order of IDs. It was so to get most recent records sooner, may be it caused
some inefficiency in the storage... But again, it was so for both ranges.

> - how big is your index?
pg_table_size('articles_pkey') = 1561 MB

> - how big is the table?
pg_table_size('articles') = 427 GB
pg_table_size('pg_toast.pg_toast_221558') = 359 GB

> - given the size of shared_buffers, almost 2M blocks should fit, but you
say 2 consecutive runs still are hitting the disk. That's strange indeed
since you are using way more than 2M blocks.
TBH, I cannot say I understand your calculations with number of blocks...
But to clarify: consecutive runs with SAME parameters do NOT hit the disk,
only the first one does, consequent ones read only from buffer cache.

> Did you check that perhaps are there any other processes or cronjobs (on
postgres and on the system) that are maybe reading data and flushing out
the cache?
I checked with iotop than nothing else reads intensively from any disk in
the system. And again, the result is 100% reproducible and depends on ID
range only, if there were any thing like these I would have noticed some
fluctuations in results.

> You can make use of pg_buffercache in order to see what is actually
cached.
It seems that there is no such a view in my DB, could it be that the module
is not installed?

> - As Laurenz suggested (VACUUM FULL), you might want to move data around.
You can try also a dump + restore to narrow the problem to data or disk
I launched VACUUM FULL, but it ran very slowly, according to my calculation
it might take 17 hours. I will try to do copy data into another table with
the same structure or spin up another server, and let you know.

> - You might also want to try to see the disk graph of Windows, while you
are running your tests. It can show you if data (and good to know how much)
is actually fetching from disk or not.
I wanted to do so but I don't have access to Hyper-V server, will try to
request credentials from admins.

Couple more observations:
1) The result of my experiment is almost not affected by other server load.
Another user was running a query (over this table) with read speed ~130
MB/s, while with my query read at 1.8-2 MB/s.
2) iotop show higher IO % (~93-94%) with slower read speed (though it is
not quite clear what this field is). A process from example above had ~55%
IO with 130 MB/s while my process had ~93% with ~2MB/s.

Regards,
Vlad

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2018-09-24 23:34:23 Re: Why could different data in a table be processed with different performance?
Previous Message Andres Freund 2018-09-24 19:51:53 Re: Explain is slow with tables having many columns