Re: Postgresql RDS DB Latency Chossing Hash join Plan

From: github kran <githubkran(at)gmail(dot)com>
To: Michael Lewis <mlewis(at)entrata(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Postgresql RDS DB Latency Chossing Hash join Plan
Date: 2019-02-17 17:03:26
Message-ID: CACaZr5StxFZ4pzzMozBkGA=g_a731BU72cZqCX_twCP55x7bYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-sql

On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <mlewis(at)entrata(dot)com> wrote:

> This is beyond my expertise except to say that if your storage is SSDs in
> AWS, then you definitely want random_page_cost close to the same as
> seq_page_cost (1 by default) assuming your data is likely to be in cache as
> discussed in the documentation
> <https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS>.
> As it says- "Reducing this value relative to seq_page_cost will cause the
> system to prefer index scans" as you saw. Changing the value on production
> would again depend on the storage type used, and how good the cache hit
> rate is.
>
> As far as I know, dropping old partitions should not be significantly
> impactful to the system other than no longer needing to store that data
> (cost, time for full backups, etc).
>
> Again, as I understand things, there is not a big impact from having old
> unused tables in terms of maintenance. They should be ignored by normal
> processes.
>
> Glad you got your issue resolved.
>
>
> *Michael Lewis*
>

Thanks for the feedback.You have been giving your thoughts/suggestions
since the beginning of the case. It was helpful. I think I realized later
based on your suggestion to increase the default statistics target from
100. It was not correctly initially
as I had that set at session level without setting them on the
partition tables. As next steps I have the stats to 1000 on all of the
partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam
currently running a load test to test
how the DB performance is behaving right now and so far its running
good than before. ( I have reset all the previous changes done except the
statistics change). I will keep you posted after the test finishes

Questions.
1) Can i further increase the Setting to 3000 and see the system
behaves. ?. How do I know the best value to be used for my database in
terms of the sampling limit with the default statistics setting ?.
2) Apart from analyzing the tables do I need to do any other changes
with the statistics setting ?
3) Also the current work mem is set to 4 MB and we didnt play with
this value so far. For future needs can I increase the WORK MEM setting ?.

Appreciate your reply.

Thanks

> On Thu, Feb 14, 2019 at 3:11 PM github kran <githubkran(at)gmail(dot)com> wrote:
>
>>
>>
>> On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis(at)entrata(dot)com>
>> wrote:
>>
>>> How many total rows in these tables? I am assuming these are partitions
>>> and those 100 million rows got distributed. If the data difference is
>>> significant, then you aren't guaranteed similar performance. You may want
>>> to follow more of the suggested steps on.
>>>
>>> https://wiki.postgresql.org/wiki/SlowQueryQuestions
>>>
>>>
>>> *Michael Lewis | Software Engineer*
>>> *Entrata*
>>> *c: **619.370.8697 <619-370-8697>*
>>>
>>> Michael - Yes correct the data of 100 million rows is distributed to all
>> the partitions.
>> FInally I feel we have come to conclusion after we changed the
>> random_page_cost from 4 to 2 in Non prod instance and we see improvements
>> in the query which use to take from 1 minute to 1 -2 seconds.
>> That's correct we have around 490 million rows in few of our partition
>> tables. The partition tables are created for every 7 days.
>>
>> We ran our API test which hits the backend database Aurora RDS PostgreSQL
>> and see our query response times , requests/sec are better than before. Do
>> you recommend this setting on a Production instance? Right now we are
>> planning to go
>> implement this option of random_page_cost to 2. ( We know it also works
>> if we turn off the hash_join but we dont want to implement this change but
>> rather use random_page_cost to 2).
>>
>> Questions.
>> 1) What is your recommendation on this ? Can we modify this change on
>> Prod instance which is performing better today or only keep this change to
>> Non prod instance ?. ( Actually we want to implement this change on Non
>> Prod instance)
>> 2) Are there are any downside in changing this value ?.
>> 3) Also another question we have a 64TB storage and running on r4 16 x
>> large EC2 instance. Few of our partitions are bigger and few of them are
>> smaller. We have data from last 9 months and we are planning to keep the
>> data for about
>> close to 1 year till May. Do you see any concerns on this ?. Eventually
>> we are thinking to archive this data in next 2 months by dropping of older
>> partitions.
>> 3) What could be the problems of keeping the data longer if there is a
>> regular maintenance like VACUUM and other maintenace activities
>>
>>
>>
>>>
>>> On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran(at)gmail(dot)com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis(at)entrata(dot)com>
>>>> wrote:
>>>>
>>>>> I didn't see your email yesterday, sorry about that. Index scans
>>>>> instead of sequential scans and nested loop instead of hash join means that
>>>>> you have bad row count estimates on "Non prod Aurora RDS instance" as far
>>>>> as I can figure. Have you run commands like-
>>>>>
>>>>> analyze asset_info_2019_2_part4;
>>>>> analyze asset_info_2019_2_part2;
>>>>>
>>>>> etc? If data are very similar, indexes all exist, and
>>>>> default_statistics_target are the same, then you should be getting the same
>>>>> plans.
>>>>>
>>>>>
>>>>> *Michael Lewis | Software Engineer*
>>>>> *Entrata*
>>>>> *c: **619.370.8697 <619-370-8697>*
>>>>>
>>>>> 1) Yes did the VACUUM for all the tables like
>>>> asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx to
>>>> remove any dead tuples and also rebuilt the indexes..
>>>> 2) REINDEX table location_data_2018_12_part4;( Like wise ran for all
>>>> the tables and also VACUUM for all the tables).
>>>> 3) The data in Non prod instance is more. One thing to mention here
>>>> when we built the Non prod instance we copied SNAPSHOT from Prod instance
>>>> and on top of that inserted data about 100 million rows and then did VACUUM
>>>> and re-indexed the tables.
>>>>
>>>> I cant think of anything we can do here but let us know if you need any
>>>> more details on this problem. Iam happy to share more details.
>>>>
>>>>
>>>>>
>>>>> On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran(at)gmail(dot)com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran(at)gmail(dot)com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis(at)entrata(dot)com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Did you update the stats by running ANALYZE on the tables involved,
>>>>>>>> or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can
>>>>>>>> you share the two execution plans?
>>>>>>>>
>>>>>>>> *Michael Lewis | Software Engineer*
>>>>>>>> *Entrata*
>>>>>>>> *c: **619.370.8697 <619-370-8697>*
>>>>>>>>
>>>>>>>
>>>>>>> Here is the plan for both of the DB instances.
>>>>>>>
>>>>>>
>>>>>> Michael - Did you get any chance to look at this issue. Also we see
>>>>>> there is a sequential scan being done instead of index scan.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran(at)gmail(dot)com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran(at)gmail(dot)com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran(at)gmail(dot)com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <
>>>>>>>>>>> mlewis(at)entrata(dot)com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Are default statistics target the same on both prod and AWS?
>>>>>>>>>>>> Have you analyzed all tables being used in this query to ensure stats are
>>>>>>>>>>>> up proper? If the optimizer is choosing a different plan, then the stats
>>>>>>>>>>>> must be different IMO.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Michael Lewis | Software Engineer*
>>>>>>>>>>>> *Entrata*
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply I have verified few of the tables and
>>>>>>>>>>> their default statistics target and they seem to be same but is there
>>>>>>>>>>> anything in particular you want me to look at it to differentiate Prod and
>>>>>>>>>>> Non prod databases ?. ( Also the DB instance size is same but there is
>>>>>>>>>>> little
>>>>>>>>>>> more data in the Non prod Aurora RDS instance compared to Prod
>>>>>>>>>>> instance).
>>>>>>>>>>>
>>>>>>>>>>> Query used.
>>>>>>>>>>>> select * from pg_stats where tablename = 'tableName'
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 11, 2019 at 2:15 PM github kran <
>>>>>>>>>>>> githubkran(at)gmail(dot)com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Postgres Team,
>>>>>>>>>>>>> We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We
>>>>>>>>>>>>> recently deleted few million rows from the database and ran into a issue in
>>>>>>>>>>>>> one of our dev account where the
>>>>>>>>>>>>> DB was not normal after this deletion. We did re index,
>>>>>>>>>>>>> vacuuming entire database but we couldnt bring it to the same state as
>>>>>>>>>>>>> earlier. So next steps we deleted the database and
>>>>>>>>>>>>> recreated the database by copying the snapshot from a
>>>>>>>>>>>>> production instance. Further did vacumming, re-index on the database.
>>>>>>>>>>>>>
>>>>>>>>>>>>> After this now the dev database seems to be in a better state
>>>>>>>>>>>>> than earlier but we are seeing few of our DB calls are taking more than 1
>>>>>>>>>>>>> minute when we are fetching data and we observed
>>>>>>>>>>>>> this is because the query plan was executing a hash join as
>>>>>>>>>>>>> part of the query whereas a similar query on prod instance is not doing any
>>>>>>>>>>>>> hash join and is returning faster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also we did not want to experiment by modifing the DB settings
>>>>>>>>>>>>> by doing enable_hash_join to off or random_page_count to 1 as we dont have
>>>>>>>>>>>>> these settings in Prod instance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note:
>>>>>>>>>>>>> The partition table sizes we have here is between 40 GB to 75
>>>>>>>>>>>>> GB and this is our normal size range, we have a new partition table for
>>>>>>>>>>>>> every 7 days.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Appreciate your ideas on what we could be missing and what we
>>>>>>>>>>>>> can correct here to reduce the query latency.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> githubKran
>>>>>>>>>>>>>
>>>>>>>>>>>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jeff Janes 2019-02-17 17:41:18 Re: FDW, too long to run explain
Previous Message Ravi Krishna 2019-02-17 16:06:39 Re: WSL (windows subsystem on linux) users will need to turn fsync off as of 11.2

Browse pgsql-sql by date

  From Date Subject
Next Message Michael Lewis 2019-02-19 17:11:33 Re: Postgresql RDS DB Latency Chossing Hash join Plan
Previous Message Rob Sargent 2019-02-15 19:03:00 Re: Help on SQL query