From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Shubham Khanna <khannashubham1197(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Improve eviction algorithm in ReorderBuffer |
Date: | 2024-02-26 14:23:42 |
Message-ID: | CAD21AoAPkLQ9C5SA8ngPzegyq1C36+hGX78FwduvM6kOyfVU9A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Feb 26, 2024 at 6:43 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>
>
> On 2/26/24 07:46, Masahiko Sawada wrote:
> > On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>...
> >>
> >> overall design
> >> --------------
> >>
> >> As for the design, I agree with the approach of using a binaryheap to
> >> track transactions by size. When going over the thread history,
> >> describing the initial approach with only keeping "large" transactions
> >> above some threshold (e.g. 10%), I was really concerned that'll either
> >> lead to abrupt changes in behavior (when transactions move just around
> >> the 10%), or won't help with many common cases (with most transactions
> >> being below the limit).
> >>
> >> I was going to suggest some sort of "binning" - keeping lists for
> >> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
> >> evicting transactions from a list, i.e. based on approximate size. But
> >> if the indexed binary heap seems to be cheap enough, I think it's a
> >> better solution.
> >
> > I've also considered the binning idea. But it was not clear to me how
> > it works well in a case where all transactions belong to the
> > particular class. For example, if we need to free up 1MB memory, we
> > could end up evicting 2000 transactions consuming 50 bytes instead of
> > 100 transactions consuming 1000 bytes, resulting in that we end up
> > serializing more transactions. Also, I'm concerned about the cost of
> > maintaining the binning lists.
> >
>
> I don't think the list maintenance would be very costly - in particular,
> the lists would not need to be sorted by size. You're right in some
> extreme cases we might evict the smallest transactions in the list. I
> think on average we'd evict transactions with average size, which seems
> OK for this use case.
>
> Anyway, I don't think we need to be distracted with this. I mentioned it
> merely to show it was considered, but the heap seems to work well
> enough, and in the end is even simpler because the complexity is hidden
> outside reorderbuffer.
>
> >>
> >> The one thing I'm a bit concerned about is the threshold used to start
> >> using binary heap - these thresholds with binary decisions may easily
> >> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
> >> with significant runtime change (e.g. you add/remove one transaction and
> >> the code takes a much more expensive path). The value (1024) seems
> >> rather arbitrary, I wonder if there's something to justify that choice.
> >
> > True. 1024 seems small to me. In my environment, I started to see a
> > big difference from around 40000 transactions. But it varies depending
> > on environments and workloads.
> >
> > I think that this performance problem we're addressing doesn't
> > normally happen as long as all transactions being decoded are
> > top-level transactions. Otherwise, we also need to improve
> > ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
> > max_connections = 1024 is a possible value in some systems, and I've
> > observed such systems sometimes. On the other hand, I've observed >
> > 5000 in just a few cases, and having more than 5000 transactions in
> > ReorderBuffer seems unlikely to happen without subtransactions. I
> > think we can say it's an extreme case, the number is still an
> > arbitrary number though.
> >
> > Or probably we can compute the threshold based on max_connections,
> > e.g., max_connections * 10. That way, we can ensure that users won't
> > incur the max-heap maintenance costs as long as they don't use
> > subtransactions.
> >
>
> Tying this to max_connections seems like an interesting option. It'd
> make this adaptive to a system. I haven't thought about the exact value
> (m_c * 10), but it seems better than arbitrary hard-coded values.
I've updated the patch accordingly, using MaxConnections for now. I've
also updated some comments and commit messages and added typedef.list
changes.
>
> >>
> >> In any case, I agree it'd be good to have some dampening factor, to
> >> reduce the risk of trashing because of adding/removing a single
> >> transaction to the decoding.
> >>
> >>
> >> related stuff / GenerationContext
> >> ---------------------------------
> >>
> >> It's not the fault of this patch, but this reminds me I have some doubts
> >> about how the eviction interferes with using the GenerationContext for
> >> some of the data. I suspect we can easily get into a situation where we
> >> evict the largest transaction, but that doesn't actually reduce the
> >> memory usage at all, because the memory context blocks are shared with
> >> some other transactions and don't get 100% empty (so we can't release
> >> them). But it's actually worse, because GenerationContext does not even
> >> reuse this memory. So do we even gain anything by the eviction?
> >>
> >> When the earlier patch versions also considered age of the transaction,
> >> to try evicting the older ones first, I think that was interesting. I
> >> think we may want to do something like this even with the binary heap.
> >
> > Thank you for raising this issue. This is one of the highest priority
> > items in my backlog. We've seen cases where the logical decoding uses
> > much more memory than logical_decoding_work_mem value[1][2] (e.g. it
> > used 4GB memory even though the logical_decoding_work_mem was 256kB).
> > I think that the problem would still happen even with this improvement
> > on the eviction.
> >
> > I believe these are separate problems we can address, and evicting
> > large transactions first would still be the right strategy. We might
> > want to improve how we store changes in memory contexts. For example,
> > it might be worth having per-transaction memory context so that we can
> > actually free memory blocks by the eviction. We can discuss it in a
> > separate thread.
> >
>
> Yes, I think using per-transaction context for large transactions might
> work. I don't think we want too many contexts, so we'd start with the
> shared context, and then at some point (when the transaction exceeds say
> 5% of the memory limit) we'd switch it to a separate one.
>
> But that's a matter for a separate patch, so let's discuss elsewhere.
+1
>
> >>
> >> For example, a system may be doing a lot of eviction / spilling with
> >> logical_decoding_work_mem=64MB, but setting 128MB may completely
> >> eliminate that. Of course, if there are large transactions, this may not
> >> be possible (the GUC would have to exceed RAM). But I don't think that's
> >> very common, the incidents that I've observed were often resolved by
> >> bumping the logical_decoding_work_mem by a little bit.
> >>
> >> I wonder if there's something we might do to help users to tune this. We
> >> should be able to measure the "peak" memory usage (how much memory we'd
> >> need to not spill), so maybe we could log that as a WARNING, similarly
> >> to checkpoints - there we only log "checkpoints too frequent, tune WAL
> >> limits", but perhaps we might do more here? Or maybe we could add the
> >> watermark to the system catalog?
> >
> > Interesting ideas.
> >
> > The statistics such as spill_count shown in pg_stat_replication_slots
> > view could already give hints to users to increase the
> > logical_decoding_work_mem. In addition to that, it's an interesting
> > idea to have the high water mark in the view.
> >
>
> The spill statistics are useful, but I'm not sure it can answer the main
> question:
>
> How high would the memory limit need to be to not spill?
>
> Maybe there's something we can measure / log to help with this.
Right. I like the idea of the high watermark. The
pg_stat_replication_slots would be the place to store such
information. Since the reorder buffer evicts or streams transactions
anyway based on the logical_decoding_work_mem, probably we need to
compute the maximum amount of data in the reorder buffer at one point
in time while assuming no transactions were evicted and streamed.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v7-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch | application/octet-stream | 16.4 KB |
v7-0001-Make-binaryheap-enlargeable.patch | application/octet-stream | 3.1 KB |
v7-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch | application/octet-stream | 16.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Ranier Vilela | 2024-02-26 14:26:27 | Re: Speeding up COPY TO for uuids and arrays |
Previous Message | Tomas Vondra | 2024-02-26 14:13:06 | Re: Shared detoast Datum proposal |