From: | Joe Carlson <jwcarlson(at)lbl(dot)gov> |
---|---|
To: | Rob Sargent <robjsargent(at)gmail(dot)com> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: TEXT column > 1Gb |
Date: | 2023-04-12 14:59:13 |
Message-ID: | 0A8EB6F5-57BC-4519-8E9A-DBB809CC09D5@lbl.gov |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I’m curious what you learned. I’ve been tripping over the buffer allocation issue when either splitting input text into chunks or aggregating chunks in selects. I’ve decided that I need to move this to client side.
The use case is genomics. Extracting substrings is common. So going to chunked storage makes sense.
And we have a menagerie of code to deal with. Legacy Perl for loading. Clients in Perl, Java, node, python. A whole zoo.
Thanks
> On Apr 11, 2023, at 10:51 AM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:
>
>
>> On 4/11/23 11:41, Joe Carlson wrote:
>> Hello,
>>
>> I’ve recently encountered the issue of trying to insert more than 1 Gb into a TEXT column. While the docs say TEXT is unlimited length, I had been unaware of the 1Gb buffer size limitations.
>>
>> We can debate whether or not saving something this big in a single column is a good idea (spoiler: it isn’t. But not my design and, in fairness, was not anticipated when the schema was designed.), I’d like to implement something that is not a major disruption and try to keep the mods on the server side. My first idea is to have a chunked associated table (in pseudo code)
>>
>> CREATE TABLE associated(key_id integer references main_table(key_id), chunk integer, text_start integer, text_end integer, text_chunk TEXT);
>>
>> And define functions for inserting and selecting by dividing into 1Mb chunks
>>
>> CREATE FUNCTION insertText(INTEGER,TEXT) RETURNS INTEGER AS $$
>> DECLARE
>> chunk INTEGER := 0;
>> key_id ALIAS for $1;
>> the_text ALIAS for $2;
>> text_chunk TEXT;
>> BEGIN
>> LOOP
>> text_chunk := substr(the_text,chunk*1000000,1000000);
>> IF length(text_chunk) = 0 THEN
>> EXIT;
>> END IF;
>> INSERT INTO associated(key_id,chunk,text_start,text_end,text_chunk) VALUES (key_id,chunk,chunk*1000000,(chunk*1000000+length(text_chunk)),text_chunk);
>> chunk := chunk + 1;
>> END LOOP;
>> RETURN chunk;
>> END;
>> $$ LANGUAGE plpgsql;
>>
>> This apparently runs into the same issues of buffers size: I get an ‘invalid message length’ in the log file and the insert fails. I can see from adding notices in the code that I never enter the LOOP; I assume having function arguments > 1Gb is also a bad thing.
>>
>> I’d like to continue to keep the modifications on the server size. And I’d like to believe someone else has had this problem before. Any suggestions other than have the client do the chunking? Can I use a different language binding and get around the argument length limitations?
>>
>> Thanks
>>
>>
>>
> I've hit this same limitation in Java (with write to db). What is your stack in this case? Not sure my solution applies.
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Adrian Klaver | 2023-04-12 15:03:53 | Re: Performance issue after migration from 9.4 to 15 |
Previous Message | Adrian Klaver | 2023-04-12 14:52:29 | Re: subscribe |