Re: badly scaling performance with appending to bytea

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Gary Cowell <gary(dot)cowell(at)gmail(dot)com>
Cc: Pgsql Performance <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: badly scaling performance with appending to bytea
Date: 2018-03-21 13:09:27
Message-ID: CAFj8pRBZ3FNTFtKo0+r+uPBKtVgnNUbiLM=y3qB0SOKfdc2c_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2018-03-21 14:04 GMT+01:00 Gary Cowell <gary(dot)cowell(at)gmail(dot)com>:

> Thank you Pavel for those ideas.
>
> I should probably have mentioned we don't have access to the file
> system on the PostgreSQL server, as it's provided by Amazon AWS RDS
> service.
>
> These functions look good when you can push the file to be loaded into
> the database file system.
>
> I'll see if it's possible to do this on AWS PostgreSQL RDS service but
> this sort of thing is usually not
>

lo API doesn't need file access

https://www.postgresql.org/docs/9.2/static/lo-interfaces.html

you can use lo_write function

> On 21 March 2018 at 12:59, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> >
> >
> > 2018-03-21 13:56 GMT+01:00 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
> >>
> >>
> >>
> >> 2018-03-21 13:03 GMT+01:00 Gary Cowell <gary(dot)cowell(at)gmail(dot)com>:
> >>>
> >>> We are trying to implement postgresql code to load a large object into
> >>> a postgresql bytea in chunks to avoid loading the file into memory in
> >>> the client.
> >>>
> >>> First attempt was to do
> >>>
> >>> update build_attachment set chunk = chunk || newdata ;
> >>>
> >>> this did not scale and got significantly slower after 4000-5000
> updates.
> >>>
> >>> The chunks are 4K in size, and I'm testing with a 128MB input file,
> >>> requiring 32,774 chunk updates.
> >>>
> >>> Next, I tried creating an aggregate, thus:
> >>>
> >>> (taken from stackoverflow)
> >>>
> >>> CREATE AGGREGATE bytea_agg(bytea) (SFUNC=byteacat,STYPE=bytea);
> >>>
> >>> changed the code to insert the chunks to a temporary table :
> >>>
> >>> create temporary table build_attachment (seq bigserial primary key,
> >>> chunk bytea ) on commit drop;
> >>>
> >>> we then insert our 4K chunks to this, which takes very little time (20
> >>> seconds for the 32,774 inserts)
> >>>
> >>> Here's an example though of trying to select the aggregate:
> >>>
> >>> gary=> \timing
> >>> Timing is on.
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 4000 \g output
> >>> Time: 13372.843 ms
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 8000 \g output
> >>> Time: 54447.541 ms
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 16000 \g output
> >>> Time: 582219.773 ms
> >>>
> >>> So those partial aggregates completed in somewhat acceptable times but
> >>> ...
> >>>
> >>> gary=> select bytea_agg(chunk order by seq) from build_attachment
> >>> where seq < 32000 \g output
> >>> this one hadn't completed in an hour - the PostgreSQL connection
> >>> process for my connection on the server goes to 100% CPU and stays
> >>> there, not using much RAM, not doing much IO, oddly
> >>>
> >>> EXPLAINing these aggregate selects doesn't show anything useful.
> >>>
> >>> Am I doomed to not be able to update a bytea this way? Is there some
> >>> way I can tune this?
> >>>
> >>
> >> bytea is immutable object without preallocation - so update of big tasks
> >> is very expensive.
> >>
> >> I am thinking so using LO API and then transformation to bytea will be
> >> much more effective
> >>
> >> \lo_import path
> >>
> >> you can use
> >>
> >> CREATE OR REPLACE FUNCTION attachment_to_bytea(attachment oid)
> >> RETURNS bytea AS $$
> >> DECLARE
> >> fd integer;
> >> size integer;
> >> BEGIN
> >> fd := lo_open(attachment, 262144);
> >> size := lo_lseek(fd, 0, 2);
> >> PERFORM lo_lseek(fd, 0, 0);
> >> RETURN loread(fd, size);
> >> EXCEPTION WHEN undefined_object THEN
> >> PERFORM lo_close(fd);
> >> RETURN NULL;
> >> END;
> >> $$ LANGUAGE plpgsql STRICT SECURITY DEFINER SET search_path =
> >> 'pg_catalog';
> >>
> >> function
> >>
> >> import cca 44MB was in few seconds
> >
> >
> > there is native function lo_get
> >
> > https://www.postgresql.org/docs/current/static/lo-funcs.html
> >
> >
> >>
> >> Regards
> >>
> >> Pavel
> >>
> >
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Akshay Ballarpure 2018-03-23 07:59:35 DB corruption
Previous Message Gary Cowell 2018-03-21 13:04:55 Re: badly scaling performance with appending to bytea