Re: Searching for Duplicates and Hosed the System

From: Bill Thoen <bthoen(at)gisnet(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Searching for Duplicates and Hosed the System
Date: 2007-08-20 23:08:35
Message-ID: 46CA1EF3.60403@gisnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Something is really screwy here. I tried what looked like a simpler task
than I tried when I started this message thread. The only unusual issue
here is that the table compliance_2006 is rather big (over 18 million
records). The table ers_regions is pretty small (about 3100 records) and
all the WHERE fields are indexed (except e.region). Here's the
not-too-complicated SQL:
SELECT a.* FROM compliance_2006 a, ers_regions e
WHERE a.fips_st_cd=e.fips_st
AND a.fips_cnty_cd=e.fips_cou
AND e.region=1;

I ran VACUUM ANALYZE just before I launched this and there were no other
postgress jobs running. I'm the only user as well. I also ran EXPLAIN
prior to the run and got this:

Nested Loop (cost=11.71..28800.34 rows=7219 width=584)
-> Seq Scan on ers_regions e (cost=0.00..71.90 rows=16 width=28)
Filter: (region = 1)
-> Bitmap Heap Scan on compliance_2006 a (cost=11.71..1788.76
rows=451 width=584)
Recheck Cond: ((a.fips_st_cd = "outer".fips_st) AND
(a.fips_cnty_cd = "outer".fips_cou))
-> Bitmap Index Scan on key_tract (cost=0.00..11.71 rows=451
width=0)
Index Cond: ((a.fips_st_cd = "outer".fips_st) AND
(a.fips_cnty_cd = "outer".fips_cou))
(7 rows)

So I launched it to see what would happen. This resulted in a very busy
disk drive for about an hour, and one by one things started dying on my
FC6 Linux machine. First the mouse disappeared, then the cursor in the
terminal window where I was running psql vanished, the clock stopped,
and the keyboard stopped responding. Meanwhile, the disk drive thrashed
on. Finally the screen saver kicked in and shortly thereafter I logged
back in only to see the word "Killed" sitting there on the last line and
all my machine's services were running again. Just no data output.

I looked in the log and saw this:
LOG: transaction ID wrap limit is 1073746500, limited by database
"postgres"
LOG: transaction ID wrap limit is 1073746500, limited by database
"postgres"
LOG: could not send data to client: Broken pipe

I guess this is what killed it, but does it make sense that such a
simple request wrapped around the transaction limit? Is the bad behavior
of my machine while running this query consistent with an error like
this or is the machine broken and that's what caused Postgresql to
crater? What should I set the wrap limit to? What suggestions would you
make for tracking down the exact problem and fixing it?

Any help would be appreciated....

- Bill Thoen

Tom Lane wrote:
> Bill Thoen <bthoen(at)gisnet(dot)com> writes:
>
>> Tom, here's the "explain" results: Does this help explain what went wrong?
>> (And yes, I think there will be a *lot* of groups.)
>>
>
>
>> explain select count(*) as count,fips_st_cd, fips_cnty_cd, farm_nbr,
>> tract_nbr, field_nbr,crop_cd,crop_status_cd,practice_cd,seq_nbr from
>> compliance_2006 group by fips_st_cd, fips_cnty_cd, farm_nbr, tract_nbr,
>> field_nbr,crop_cd,crop_status_cd,practice_cd,seq_nbr order by 1 desc;
>>
>
>
>> QUERY PLAN
>> --------------------------------------------------------
>> Sort (cost=15119390.46..15123902.54 rows=1804832 width=160)
>> Sort Key: count(*)
>> -> GroupAggregate (cost=13782933.29..14301822.43 rows=1804832
>> width=160)
>> -> Sort (cost=13782933.29..13828054.08 rows=18048318 width=160)
>> Sort Key: fips_st_cd, fips_cnty_cd, farm_nbr, tract_nbr,
>> field_nbr, crop_cd, crop_status_cd, practice_cd, seq_nbr
>> -> Seq Scan on compliance_2006 (cost=0.00..1039927.18
>> rows=18048318 width=160)
>> (6 rows)
>>
>
> Hmm ... no, actually, that shows the planner doing the right thing for
> lotsa groups: picking GroupAggregate instead of HashAggregate. The
> estimated number of groups is 1804832, which might or might not have
> much to do with reality but in any case seems enough to keep it away
> from HashAggregate.
>
> Do you have autovacuum running, or a scheduled cronjob that runs ANALYZE
> or VACUUM ANALYZE? The only theory I can think of at this point is that
> your database statistics are more correct now than they were when you
> had the problem.
>
> If you try the query again, does it behave more sanely?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org/
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Glaesemann 2007-08-20 23:09:49 Re: Using oid as pkey
Previous Message Joshua D. Drake 2007-08-20 22:56:14 Re: Using oid as pkey