From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: 9.5: Memory-bounded HashAgg |
Date: | 2014-09-03 23:34:42 |
Message-ID: | 5407A592.3060209@fuzzy.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 20.8.2014 20:32, Robert Haas wrote:
> On Sun, Aug 17, 2014 at 1:17 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> Being able to batch inner and outer relations in a matching way is
>> certainly one of the reasons why hashjoin uses that particular scheme.
>> There are other reasons, though - for example being able to answer 'Does
>> this group belong to this batch?' quickly, and automatically update
>> number of batches.
>>
>> I'm not saying the lookup is extremely costly, but I'd be very surprised
>> if it was as cheap as modulo on a 32-bit integer. Not saying it's the
>> dominant cost here, but memory bandwidth is quickly becoming one of the
>> main bottlenecks.
>
> Well, I think you're certainly right that a hash table lookup is more
> expensive than modulo on a 32-bit integer; so much is obvious. But if
> the load factor is not too large, I think that it's not a *lot* more
> expensive, so it could be worth it if it gives us other advantages.
Yes, that may be true. I'm not opposed to Jeff's approach in general -
it's certainly a nice solution for cases with fixed size of the
aggregate states.
But I still don't see how it could handle the aggregates with growing
aggregate state (which is the case that troubles me, because that's what
we see in our workloads).
> As I see it, the advantage of Jeff's approach is that it doesn't
> really matter whether our estimates are accurate or not. We don't
> have to decide at the beginning how many batches to do, and then
> possibly end up using too much or too little memory per batch if we're
> wrong; we can let the amount of memory actually used during execution
> determine the number of batches. That seems good. Of course, a hash
Yes. I think that maybe we could use Jeff's approach even for 'growing
aggregate state' case, assuming we can serialize the aggregate states
and release the memory properly.
First, the problem with the current hash table used in HashAggregate
(i.e. dynahash) is that it never actually frees memory - when you do
HASH_REMOVE it only moves it to a list of entries for future use.
Imagine a workload where you initially see only 1 tuple for each group
before work_mem gets full. At that point you stop adding new groups, but
the current ones will grow. Even if you know how to serialize the
aggregate states (which we don't), you're in trouble because the initial
state is small (only 1 tuple was passed to the group) and most of the
memory is stuck in dynahash.
> join can increase the number of batches on the fly, but only by
> doubling it, so you might go from 4 batches to 8 when 5 would really
> have been enough. And a hash join also can't *reduce* the number of
> batches on the fly, which might matter a lot. Getting the number of
> batches right avoids I/O, which is a lot more expensive than CPU.
Regarding the estimates, I don't see much difference between the two
approaches when handling this issue.
It's true you can wait with deciding how many partitions (aka batches)
to create until work_mem is full, at which point you have more
information than at the very beginning. You know how many tuples you've
already seen, how many tuples you expect (which is however only an
estimate etc.). And you may use that to estimate the number of
partitions to create.
That however comes at a cost - it's not really a memory-bounded hash
aggregate, because you explicitly allow exceeding work_mem as more
tuples for existing groups arrive.
Also, no one really says the initial estimate of how many tuples will be
aggregated is correct. It's about as unreliable as the group count
estimate. So how exactly are you going to estimate the partitions?
Considering this, I doubt being able to choose arbitrary number of
partitions (instead of only powers of 2) is really an advantage.
Reducing the number of partitions might matter, but in my experience
most estimation errors are underestimations. Because we assume
independence where in practice columns are dependent, etc.
I agree that getting the batches right is important, but OTOH when using
hash join using more smaller batches is often significantly faster than
using one large one. So it depends.
Whe I think we should prevent is under-estimating the number of batches,
because in that case you have to read the whole batch, write part of it
again and then read it again. Instead of just writing it once (into two
files). Reading a tuple from a batch only to write it to another batch
is not really efficient.
>>> But the situation here isn't comparable, because there's only one
>>> input stream. I'm pretty sure we'll want to keep track of which
>>> transition states we've spilled due to lack of memory as opposed to
>>> those which were never present in the table at all, so that we can
>>> segregate the unprocessed tuples that pertain to spilled transition
>>> states from the ones that pertain to a group we haven't begun yet.
>>
>> Why would that be necessary or useful? I don't see a reason for tracking
>> that / segregating the tuples.
>
> Suppose there are going to be three groups: A, B, C. Each is an
> array_agg(), and they're big, so only of them will fit in work_mem at
> a time. However, we don't know that at the beginning, either because
> we don't write the code to try or because we do write that code but
> our cardinality estimates are way off; instead, we're under the
> impression that all four will fit in work_mem. So we start reading
> tuples. We see values for A and B, but we don't see any values for C
> because those all occur later in the input. Eventually, we run short
> of memory and cut off creation of new groups. Any tuples for C are
> now going to get written to a tape from which we'll later reread them.
> After a while, even that proves insufficient and we spill the
> transition state for B to disk. Any further tuples that show up for C
> will need to be written to tape as well. We continue processing and
> finish group A.
>
> Now it's time to do batch #2. Presumably, we begin by reloading the
> serialized transition state for group B. To finish group B, we must
> look at all the tuples that might possibly fall in that group. If all
> of the remaining tuples are on a single tape, we'll have to read all
> the tuples in group B *and* all the tuples in group C; we'll
> presumably rewrite the tuples that are not part of this batch onto a
> new tape, which we'll then process in batch #3. But if we took
> advantage of the first pass through the input to put the tuples for
> group B on one tape and the tuples for group C on another tape, we can
> be much more efficient - just read the remaining tuples for group B,
> not mixed with anything else, and then read a separate tape for group
> C.
OK, I understand the idea. However I don't think it makes much sense to
segregate every little group - that's a perfect fit for batching.
What might be worth segregating are exceptionally large groups, because
that what may cause batching inefficient - for example when a group is
larger than work_mem, it will result in a batch per group (even if those
remaining groups are tiny). But we have no way to identify this group,
because we have no way to determine the size of the state.
What we might do is assume that the size is proportional to number of
tuples, and segregate only those largest groups. This can easily be done
with hashjoin-like batching - adding ntuples, isSegregated and
skewBatchId the AggHashEntry. The placeholder (only the hash entry will
be stored in the batch, but the actual state etc. will be stored
separetely). This is a bit similar to how hashjoin handles skew buckets.
It's true that Jeff's approach handles this somewhat better, but at the
cost of not really bounding the memory consumed by HashAggregate.
Tomas
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2014-09-03 23:41:34 | Re: 9.5: Memory-bounded HashAgg |
Previous Message | Noah Yetter | 2014-09-03 23:12:30 | Re: Pg_upgrade and toast tables bug discovered |