Re: Join runs for > 10 hours and then fills up >1.3TB of disk space

From: "Claus Guttesen" <kometen(at)gmail(dot)com>
To: "kevin kempter" <kevin(at)kevinkempterllc(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Join runs for > 10 hours and then fills up >1.3TB of disk space
Date: 2008-05-16 07:15:14
Message-ID: b41c75520805160015s1d85e0a1ge6b2fa896426dc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> I have a table with 9,961,914 rows in it (see the describe of
> bigtab_stats_fact_tmp14 below)
>
> I also have a table with 7,785 rows in it (see the describe of xsegment_dim
> below)
>
> I'm running the join shown below and it takes > 10 hours and eventually runs
> out of disk space on a 1.4TB file system
>
> I've included below a describe of both tables, the join and an explain plan,
> any help / suggestions would be much appreciated !
>
> I need to get this beast to run as quickly as possible (without filling up
> my file system)
>
>
> Thanks in advance...

What version of postgresql are you using? According to
http://www.postgresql.org/docs/8.2/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY
you may benefit from adjusting work_mem.

You also index segment_srcid (in table xsegment_dim) but if you search
for NULL and you have enough of those it defaults to a seq. scan:

Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 width=40)
> Filter: (segment_srcid IS NULL)

Maby you could insert some default value into segment_srcid (some
arbitrary large numbers) instead of NULL and then search for values
greater than??

You could also try to lower random_page_cost from default to 2.

> select
> f14.xpublisher_dim_id,
> f14.xtime_dim_id,
> f14.xlocation_dim_id,
> f14.xreferrer_dim_id,
> f14.xsite_dim_id,
> f14.xsystem_cfg_dim_id,
> f14.xaffiliate_dim_id,
> f14.customer_id,
> pf_dts_id,
> episode_id,
> sessionid,
> bytes_received,
> bytes_transmitted,
> total_played_time_sec,
> segdim.xsegment_dim_id as episode_level_segid
> from
> bigtab_stats_fact_tmp14 f14,
> xsegment_dim segdim
> where
> f14.customer_id = segdim.customer_srcid
> and f14.show_id = segdim.show_srcid
> and f14.season_id = segdim.season_srcid
> and f14.episode_id = segdim.episode_srcid
> and segdim.segment_srcid is NULL;
>
>
>
>
>
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118)
> Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND
> (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid =
> f14.show_id) AND (segdim.season_srcid = f14.season_id))
> -> Sort (cost=1570.35..1579.46 rows=3643 width=40)
> Sort Key: segdim.episode_srcid, segdim.customer_srcid, segdim.show_srcid,
> segdim.season_srcid
> -> Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 width=40)
> Filter: (segment_srcid IS NULL)
> -> Sort (cost=1755323.26..1780227.95 rows=9961874 width=126)
> Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id
> -> Seq Scan on bigtab_stats_fact_tmp14 f14 (cost=0.00..597355.74
> rows=9961874 width=126)
> (9 rows)
>
>
>
>
>
>
>
>
>
> # \d bigtab_stats_fact_tmp14
> Table "public.bigtab_stats_fact_tmp14"
> Column | Type | Modifiers
> --------------------------+-----------------------------+-----------
> pf_dts_id | bigint |
> pf_device_id | bigint |
> segment_id | bigint |
> cdn_id | bigint |
> collector_id | bigint |
> digital_envoy_id | bigint |
> maxmind_id | bigint |
> quova_id | bigint |
> website_id | bigint |
> referrer_id | bigint |
> affiliate_id | bigint |
> custom_info_id | bigint |
> start_dt | timestamp without time zone |
> total_played_time_sec | numeric(18,5) |
> bytes_received | bigint |
> bytes_transmitted | bigint |
> stall_count | integer |
> stall_duration_sec | numeric(18,5) |
> hiccup_count | integer |
> hiccup_duration_sec | numeric(18,5) |
> watched_duration_sec | numeric(18,5) |
> rewatched_duration_sec | numeric(18,5) |
> requested_start_position | numeric(18,5) |
> requested_stop_position | numeric(18,5) |
> post_position | numeric(18,5) |
> is_vod | numeric(1,0) |
> sessionid | bigint |
> create_dt | timestamp without time zone |
> segment_type_id | bigint |
> customer_id | bigint |
> content_publisher_id | bigint |
> content_owner_id | bigint |
> episode_id | bigint |
> duration_sec | numeric(18,5) |
> device_id | bigint |
> os_id | bigint |
> browser_id | bigint |
> cpu_id | bigint |
> xsystem_cfg_dim_id | bigint |
> xreferrer_dim_id | bigint |
> xaffiliate_dim_id | bigint |
> xsite_dim_id | bigint |
> xpublisher_dim_id | bigint |
> season_id | bigint |
> show_id | bigint |
> xsegment_dim_id | bigint |
> location_id | bigint |
> zipcode | character varying(20) |
> xlocation_dim_id | bigint |
> location_srcid | bigint |
> timezone | real |
> xtime_dim_id | bigint |
> Indexes:
> "bigtab_stats_fact_tmp14_idx1" btree (customer_id)
> "bigtab_stats_fact_tmp14_idx2" btree (show_id)
> "bigtab_stats_fact_tmp14_idx3" btree (season_id)
> "bigtab_stats_fact_tmp14_idx4" btree (episode_id)
>
>
>
>
>
>
> # \d xsegment_dim
> Table "public.xsegment_dim"
> Column | Type |
> Modifiers
> ----------------------+-----------------------------+-------------------------------------------------------------
> xsegment_dim_id | bigint | not null default
> nextval('xsegment_dim_seq'::regclass)
> customer_srcid | bigint | not null
> show_srcid | bigint | not null
> show_name | character varying(500) | not null
> season_srcid | bigint | not null
> season_name | character varying(500) | not null
> episode_srcid | bigint | not null
> episode_name | character varying(500) | not null
> segment_type_id | integer |
> segment_type | character varying(500) |
> segment_srcid | bigint |
> segment_name | character varying(500) |
> effective_dt | timestamp without time zone | not null default now()
> inactive_dt | timestamp without time zone |
> last_update_dt | timestamp without time zone | not null default now()
> Indexes:
> "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id)
> "seg1" btree (customer_srcid)
> "seg2" btree (show_srcid)
> "seg3" btree (season_srcid)
> "seg4" btree (episode_srcid)
> "seg5" btree (segment_srcid)
> "xsegment_dim_ix1" btree (customer_srcid)
>
>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Simon Riggs 2008-05-16 07:38:30 Re: Join runs for > 10 hours and then fills up >1.3TB of disk space
Previous Message kevin kempter 2008-05-16 07:08:37 Re: Join runs for > 10 hours and then fills up >1.3TB of disk space