From: | James Coleman <jtc331(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date: | 2019-11-10 19:24:10 |
Message-ID: | CAAaqYe_w=Q2F9tM0EyUvLn=CrwM99S3w9LE=MgghxKa6SRwdJw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Sun, Nov 10, 2019 at 12:29 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote:
> >On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >>
> >> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >> > So I should have run the earlier attached plan with VERBOSE, but
> >> > here's the interesting thing: the parallel hash node's seq scan node
> >> > outputs two columns: let's call them (from the redacted plan)
> >> > items.system_id and items.account_id. The first (system_id) is both
> >> > not null and unique; the second (account_id) definitely has massive
> >> > skew. I'm not very up-to-speed on how the hash building works, but I
> >> > would have (perhaps naīvely?) assumed that the first column being
> >> > unique would make the hash keys very likely not to collide in any
> >> > significantly skewed way. Am I missing something here?
> >>
> >> Hrm. So the compound key is unique then. I was assuming up until now
> >> that it had duplicates. The hashes of the individual keys are
> >> combined (see ExecHashGetHashValue()), so assuming there is nothing
> >> funky about the way citext gets hashed (and it's not jumping out at
> >> me), your unique keys should give you uniform hash values and thus
> >> partition size, and repartitioning should be an effective way of
> >> reducing hash table size. So now it sounds like you have a simple
> >> case of underestimation, but now I'm confused about how you got a
> >> 344MB hash table with work_mem = 150MB:
> >
>
> Why would the compound key matter? The join only does this
>
> Hash Cond: (group_details.group_number = items.system_id)
>
> so the only thing that really matters when it comes to skew is
> system_id, i.e. the citext column. The other column is used by
> aggregation, but that's irrelevant here, no?
Yeah, that seems right. I think I got confused because account_id is
used in another join (well, technically a correlated subquery, but
effectively a join in implementation).
But even so, system_id is unique and not null. So the above discussion
of how uniqueness affects this still holds, right?
> >Looking at the source, citext's hash is a pretty standard call to
> >hash_any, so I don't see how that would lead to any oddities (barring
> >an intentional hash collision etc., but this is real data).
> >
> >Do you have any theories about where the underestimation is happening?
> >It knows the number of rows reasonably well. The one thing I'm not
> >sure about yet is the row width = 16. The account_id is a bigint, so
> >we can assume it knows the size properly there. The system_id being
> >citext...I think the 8 bytes it has for that is probably a bit low on
> >average, but I'm not sure yet by how much (I'm going to run a query to
> >find out). Hmm. Query came back, and average length is just shy of
> >7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
> >of why/what it would be underestimating (does it know about
> >uniqueness/non-uniqueness? could that be a factor?)?
> >
>
> I think the easiest thing we can do is running a couple of queries
> collecting useful stats, like
>
> -- the most frequent system_id values
> select system_id, count(*), sum(length(system_id)) from items
> gtoup by system_id
> order by count(*) desc limit 100;
>
> -- the largest group by system_id
> select system_id, count(*), sum(length(system_id)) from items
> gtoup by system_id
> order by sum(length(system_id)) desc limit 100;
>
> That should tell us if there's something off.
Since the column is unique, we don't need to do any grouping to get
stats; as noted earlier I queried and found that the average length
was just shy of 7.9 chars long (though I happen to know the length
varies at least between 7 and 34).
> >I also don't know why it seems to regularly fail on the primary, but
> >not on the sync, unless we adjust the work_mem up. I've double-checked
> >all GUCs and the only differences are things related to replication
> >that you'd expect to be different on primary/replica.
> >
> >I know there are some things that execute differently on replicas, so
> >I assume it's something like that, but I don't know what specifically
> >would cause this here.
> >
>
> Hmm, I haven't realized it fails on primary more often. That's certainly
> strange, I don't have a very good idea why that could be happening. Do
> the query plans look the same? How does the EXPLAIN ANALYZE look like?
So far they all seem to look very similar. The other possibility is
that it's just that the replica is behind (it's not a sync replica),
but this has happened enough times on different days that that
explanation feels pretty unsatisfactory to me.
As mentioned in my previous email, I'm slowly checking against various
work_mem values to see if I can determine 1.) where it starts to only
fail on the primary and 2.) if the buckets stats start to grow before
the failure case.
So far I've seen it fail on both primary and sync at 500MB and 400MB,
300MB, and 200MB, and, interestingly, with today's data, succeed on
both at 150MB. That last point does imply data changing could have
something to do with the sync/replica difference I suppose. ...And, it
just failed on the replica at 150MB and succeeded on the primary at
150MB. So...probably data changing?
In all of those runs I haven't seen the bucket stats change.
> >> Buckets: 4194304 (originally 4194304) Batches: 32768 (originally
> >> 4096) Memory Usage: 344448kB
> >>
> >> And I'm confused about what was different when it wanted the crazy
> >> number of batches.
> >
> >I'm not quite sure how to find out; if you have any ideas, I'd love to
> >hear them. The one thing I can think of to try is to slowly increase
> >work_mem (since setting it to 500MB reproduced the error on the
> >replica) and see if the bucket info starts to trend up.
> >
>
> Chances are it actually happened to detect skew and disabled the growth,
> hence the table grew above work_mem. We should probably add info about
> this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode.
Would skew be likely though if the column is unique?
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2019-11-10 21:08:58 | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Previous Message | Tomas Vondra | 2019-11-10 17:29:25 | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |