From: | Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> |
---|---|
To: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Joe Conway <mail(at)joeconway(dot)com> |
Subject: | Re: Faster methods for getting SPI results (460% improvement) |
Date: | 2017-01-24 03:23:07 |
Message-ID: | 4f11b9c9-4b2a-0552-faa7-24d255173679@BlueTreble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 1/5/17 9:50 PM, Jim Nasby wrote:
> The * on that is there's something odd going on where plpython starts
> out really fast at this, then gets 100% slower. I've reached out to some
> python folks about that. Even so, the overall results from a quick test
> on my laptop are (IMHO) impressive:
>
> Old Code New Code Improvement
> Pure SQL 2 sec 2 sec
> plpython 12.7-14 sec 4-10 sec ~1.3-3x
> plpython - SQL 10.7-12 sec 2-8 sec ~1.3-6x
>
> Pure SQL is how long an equivalent query takes to run with just SQL.
> plpython - SQL is simply the raw python times minus the pure SQL time.
I finally got all the kinks worked out and did some testing with python
3. Performance for my test [1] improved ~460% when returning a dict of
lists (as opposed to the current list of dicts). Based on previous
testing, I expect that using this method to return a list of dicts will
be about 8% slower. The inconsistency in results on 2.7 has to do with
how python 2 handles ints.
Someone who's familiar with pl/perl should take a look at this and see
if it would apply there. I've attached the SPI portion of this patch.
I think the last step here is to figure out how to support switching
between the current behavior and the "columnar" behavior of a dict of
lists. I believe the best way to do that is to add two optional
arguments to the execution functions: container=[] and members={}, and
then copy those to produce the output objects. That means you can get
the new behavior by doing something like:
plpy.execute('...', container={}, members=[])
Or, more interesting, you could do:
plpy.execute('...', container=Pandas.DataFrame, members=Pandas.Series)
since that's what a lot of people are going to want anyway.
In the future we could also add a GUC to change the default behavior.
Any concerns with that approach?
1:
> d = plpy.execute('SELECT s AS some_table_id, s AS some_field_name, s AS some_other_field_name FROM generate_series(1,{}) s'.format(iter) )
> return len(d['some_table_id'])
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
Attachment | Content-Type | Size |
---|---|---|
spi_callback.patch | text/plain | 7.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter van Hardenberg | 2017-01-24 03:42:56 | Re: GSoC 2017 |
Previous Message | Merlin Moncure | 2017-01-24 03:11:37 | Re: Checksums by default? |