Quick Links

Re: Out of Memory and Configuration Problems (Big Computer)

From:	Tom Wilcox <hungrytom(at)googlemail(dot)com>
To:
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Out of Memory and Configuration Problems (Big Computer)
Date:	2010-05-28 18:39:56
Message-ID:	AANLkTimQR0EbXAeyXN3usdIQOrHos4MzHllKRcwWLRt1@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Oops. Sorry about that.

I am having this problem with multiple queries however I am confident that a
fair number may involve the custom plpython "normalise" function which I
have made myself. I didn't think it would be complicated enough to produce a
memory problem.. here it is:

-- Normalises common address words (i.e. 'Ground' maps to 'grd')
CREATE OR REPLACE FUNCTION normalise(s text) RETURNS text AS $$
ADDR_FIELD_DELIM = ' '

# Returns distinct list without null or empty elements
def distinct_str(list):
seen = set()
return [x for x in list if x not in seen and not seen.add(x) and x!=None
and len(x)>0]

# normalise common words in given address string
def normalise(match_data):
if match_data==None: return ''
import re
# Tokenise
toks = distinct_str(re.split(r'\s', match_data.lower()))
out = ''
for tok in toks:
## full word replace
if tok == 'house' : out += 'hse'+ADDR_FIELD_DELIM
elif tok == 'ground' : out += 'grd'+ADDR_FIELD_DELIM
elif tok == 'gnd' : out += 'grd'+ADDR_FIELD_DELIM
elif tok == 'front' : out += 'fnt'+ADDR_FIELD_DELIM
elif tok == 'floor' : out += 'flr'+ADDR_FIELD_DELIM
elif tok == 'floors' : out += 'flr'+ADDR_FIELD_DELIM
elif tok == 'flrs' : out += 'flr'+ADDR_FIELD_DELIM
elif tok == 'fl' : out += 'flr'+ADDR_FIELD_DELIM
elif tok == 'basement' : out += 'bst'+ADDR_FIELD_DELIM
elif tok == 'subbasement' : out += 'sbst'+ADDR_FIELD_DELIM
elif tok == 'bsmt' : out += 'bst'+ADDR_FIELD_DELIM
elif tok == 'lbst' : out += 'lower bst'+ADDR_FIELD_DELIM
elif tok == 'street' : out += 'st'+ADDR_FIELD_DELIM
elif tok == 'road' : out += 'rd'+ADDR_FIELD_DELIM
elif tok == 'lane' : out += 'ln'+ADDR_FIELD_DELIM
elif tok == 'rooms' : out += 'rm'+ADDR_FIELD_DELIM
elif tok == 'room' : out += 'rm'+ADDR_FIELD_DELIM
elif tok == 'no' : pass
elif tok == 'number' : pass
elif tok == 'and' : out += '&'+ADDR_FIELD_DELIM
elif tok == 'rear' : out += 'rr'+ADDR_FIELD_DELIM
elif tok == 'part' : out += 'pt'+ADDR_FIELD_DELIM
elif tok == 'south' : out += 's'+ADDR_FIELD_DELIM
elif tok == 'sth' : out += 's'+ADDR_FIELD_DELIM
elif tok == 'north' : out += 'n'+ADDR_FIELD_DELIM
elif tok == 'nth' : out += 'n'+ADDR_FIELD_DELIM
elif tok == 'west' : out += 'w'+ADDR_FIELD_DELIM
elif tok == 'wst' : out += 'w'+ADDR_FIELD_DELIM
elif tok == 'east' : out += 'e'+ADDR_FIELD_DELIM
elif tok == 'est' : out += 'e'+ADDR_FIELD_DELIM
elif tok == 'first' : out += '1st'+ADDR_FIELD_DELIM
elif tok == 'second' : out += '2nd'+ADDR_FIELD_DELIM
elif tok == 'third' : out += '3rd'+ADDR_FIELD_DELIM
elif tok == 'fourth' : out += '4th'+ADDR_FIELD_DELIM
elif tok == 'fifth' : out += '5th'+ADDR_FIELD_DELIM
elif tok == 'sixth' : out += '6th'+ADDR_FIELD_DELIM
elif tok == 'seventh' : out += '7th'+ADDR_FIELD_DELIM
elif tok == 'eighth' : out += '8th'+ADDR_FIELD_DELIM
elif tok == 'ninth' : out += '9th'+ADDR_FIELD_DELIM
elif tok == 'tenth' : out += '10th'+ADDR_FIELD_DELIM
elif tok == 'eleventh' : out += '11th'+ADDR_FIELD_DELIM
elif tok == 'twelfth' : out += '12th'+ADDR_FIELD_DELIM
elif tok == 'thirteenth' : out += '13th'+ADDR_FIELD_DELIM
elif tok == 'fourteenth' : out += '14th'+ADDR_FIELD_DELIM
elif tok == 'fifteenth' : out += '15th'+ADDR_FIELD_DELIM
elif tok == 'sixteenth' : out += '16th'+ADDR_FIELD_DELIM
elif tok == 'seventeenth' : out += '17th'+ADDR_FIELD_DELIM
elif tok == 'eighteenth' : out += '18th'+ADDR_FIELD_DELIM
elif tok == 'nineteenth' : out += '19th'+ADDR_FIELD_DELIM
elif tok == 'twentieth' : out += '20th'+ADDR_FIELD_DELIM
# numbers 0 - 20
elif tok == 'one' : out += '1'+ADDR_FIELD_DELIM
elif tok == 'two' : out += '2'+ADDR_FIELD_DELIM
elif tok == 'three' : out += '3'+ADDR_FIELD_DELIM
elif tok == 'four' : out += '4'+ADDR_FIELD_DELIM
elif tok == 'five' : out += '5'+ADDR_FIELD_DELIM
elif tok == 'six' : out += '6'+ADDR_FIELD_DELIM
elif tok == 'seven' : out += '7'+ADDR_FIELD_DELIM
elif tok == 'eight' : out += '8'+ADDR_FIELD_DELIM
elif tok == 'nine' : out += '9'+ADDR_FIELD_DELIM
elif tok == 'ten' : out += '10'+ADDR_FIELD_DELIM
elif tok == 'eleven' : out += '11'+ADDR_FIELD_DELIM
elif tok == 'twelve' : out += '12'+ADDR_FIELD_DELIM
elif tok == 'thirteen' : out += '13'+ADDR_FIELD_DELIM
elif tok == 'fourteen' : out += '14'+ADDR_FIELD_DELIM
elif tok == 'fifteen' : out += '15'+ADDR_FIELD_DELIM
elif tok == 'sixteen' : out += '16'+ADDR_FIELD_DELIM
elif tok == 'seventeen' : out += '17'+ADDR_FIELD_DELIM
elif tok == 'eighteen' : out += '18'+ADDR_FIELD_DELIM
elif tok == 'nineteen' : out += '19'+ADDR_FIELD_DELIM
elif tok == 'twenty' : out += '20'+ADDR_FIELD_DELIM
# town dictionary items
elif tok == 'borough' : pass
elif tok == 'city' : pass
elif tok == 'of' : pass
elif tok == 'the' : pass
# a few extras (from looking at voa)
elif tok == 'at' : pass
elif tok == 'incl' : pass
elif tok == 'inc' : pass
else: out += tok+ADDR_FIELD_DELIM
return out

return normalise(s)
$$ LANGUAGE plpythonu;

Here's the create script for the table from pgAdmin (I hope that will be
good enough instead of \d as I can't do that right now)..

-- Table: nlpg.match_data

-- DROP TABLE nlpg.match_data;

CREATE TABLE nlpg.match_data
(
premise_id integer,
usrn bigint,
org text,
sao text,
"level" text,
pao text,
"name" text,
street text,
town text,
pc postcode,
postcode text,
match_data_id integer NOT NULL DEFAULT
nextval('nlpg.match_data_match_data_id_seq1'::regclass),
addr_str text,
tssearch_name tsvector,
CONSTRAINT match_data_pkey1 PRIMARY KEY (match_data_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE nlpg.match_data OWNER TO postgres;
ALTER TABLE nlpg.match_data ALTER COLUMN "name" SET STATISTICS 10000;

-- Index: nlpg.index_match_data_mid

-- DROP INDEX nlpg.index_match_data_mid;

CREATE INDEX index_match_data_mid
ON nlpg.match_data
USING btree
(match_data_id);

-- Index: nlpg.index_match_data_pc

-- DROP INDEX nlpg.index_match_data_pc;

CREATE INDEX index_match_data_pc
ON nlpg.match_data
USING btree
(pc);

-- Index: nlpg.index_match_data_pid

-- DROP INDEX nlpg.index_match_data_pid;

CREATE INDEX index_match_data_pid
ON nlpg.match_data
USING btree
(premise_id);

-- Index: nlpg.index_match_data_tssearch_name

-- DROP INDEX nlpg.index_match_data_tssearch_name;

CREATE INDEX index_match_data_tssearch_name
ON nlpg.match_data
USING gin
(tssearch_name);

-- Index: nlpg.index_match_data_usrn

-- DROP INDEX nlpg.index_match_data_usrn;

CREATE INDEX index_match_data_usrn
ON nlpg.match_data
USING btree
(usrn);

As you can see, no FKs or triggers..

I am running: EXPLAIN ANALYZE UPDATE nlpg.match_data SET org = org;

However, as it should take around 90mins (if it is linear) then I thought I
would send this now and follow up with the results once it finishes. (Has
taken 2hours so far..)

Thanks very much for your help.

Tom

On 28 May 2010 17:54, "Bill Moran" <wmoran(at)potentialtech(dot)com> wrote:

In response to Tom Wilcox <hungrytom(at)googlemail(dot)com>:

> In addition, I have discovered that the update query that runs on each row
> of a 27million row ta...
You're not liable to get shit for answers if you omit the mailing list from
the conversation, especially since I know almost nothing about tuning
PostgreSQL installed on Windows.

Are there multiple queries having this problem? The original query didn't
have normalise() in it, and I would be highly suspicious that a custom
function may have a memory leak or other memory-intensive side-effects.
What is the code for that function?

For example, does:
UPDATE nlpg.match_data SET org = org WHERE match_data_id;
finish in a reasonable amount of time or exhibit the same out of memory
problem?

It'd be nice to see a \d on that table ... does it have any triggers or
cascading foreign keys?

And stop

Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/<http://people.collaborativefusion.com/%7Ewmoran/>

In response to

Re: Out of Memory and Configuration Problems (Big Computer) at 2010-05-28 16:54:49 from Bill Moran

Responses

Re: Out of Memory and Configuration Problems (Big Computer) at 2010-05-28 23:34:33 from Alban Hertroys
Re: Out of Memory and Configuration Problems (Big Computer) at 2010-05-29 00:43:48 from Tom Wilcox

Browse pgsql-general by date

	From	Date	Subject
Next Message	erobles	2010-05-28 18:45:38	Re: there is a way to deactivate type validation on 8.3.1????
Previous Message	Bill Moran	2010-05-28 18:34:31	Re: there is a way to deactivate type validation on 8.3.1????