Quick Links

DML and column cound in aggregated subqueries

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	DML and column cound in aggregated subqueries
Date:	2016-10-31 11:02:46
Message-ID:	20161031110245.o4ujpioamf62uhdo@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

while hacking up nodeAgg.c to use a dedicated slot/desc for the
hashtable, an assert I placed brought up a weird case:

regression[29535][1]=# EXPLAIN VERBOSE update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
┌───────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────┤
│ Update on public.bar (cost=42.75..109.25 rows=1130 width=20) │
│ -> Hash Join (cost=42.75..109.25 rows=1130 width=20) │
│ Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid │
│ Hash Cond: (bar.f1 = foo.f1) │
│ -> Seq Scan on public.bar (cost=0.00..32.60 rows=2260 width=14) │
│ Output: bar.f1, bar.f2, bar.ctid │
│ -> Hash (cost=40.25..40.25 rows=200 width=10) │
│ Output: foo.ctid, foo.f1 │
│ -> HashAggregate (cost=38.25..40.25 rows=200 width=10) │
│ Output: foo.ctid, foo.f1 │
│ Group Key: foo.f1 │
│ -> Seq Scan on public.foo (cost=0.00..32.60 rows=2260 width=10) │
│ Output: foo.ctid, foo.f1 │
└───────────────────────────────────────────────────────────────────────────────────────┘
(13 rows)

this doesn't look right. The ctid shouldn't be in the aggregate output -
after all it's pretty much meaningless here.

Note that in this case find_hash_columns(aggstate) will return two
columns, but node->numCols will be 1 (and thus node->grpOperators only
have one element).

So it seems we're doing something wrong here.

Casting a wider net: find_hash_columns() and it's subroutines seem like
pretty much dead code, including outdated comments (look for "SQL99
semantics"). The main explanation says: /* * Create a list of the tuple
columns that actually need to be stored in * hashtable entries. The
incoming tuples from the child plan node will * contain grouping
columns, other columns referenced in our targetlist and * qual, columns
used to compute the aggregate functions, and perhaps just * junk columns
we don't use at all. Only columns of the first two types * need to be
stored in the hashtable, and getting rid of the others can * make the
table entries significantly smaller. To avoid messing up Var *
numbering, we keep the same tuple descriptor for hashtable entries as
the * incoming tuples have, but set unwanted columns to NULL in the
tuples that * go into the table. but that seems bogus - when are we
referencing "other columns referenced in our targetlist and qual" from
representative tuple that aren't also grouped upon? This method would
select pretty arbitrary value for those, since we don't hash them, so
it'll just be the tuple first occurred?

Greetings,

Andres Freund

Responses

Re: DML and column cound in aggregated subqueries at 2016-10-31 13:35:57 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2016-10-31 12:31:32	Re: WAL consistency check facility
Previous Message	Emre Hasegeli	2016-10-31 10:56:50	Re: Improve output of BitmapAnd EXPLAIN ANALYZE