From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: WIP: [[Parallel] Shared] Hash |
Date: | 2017-03-26 21:47:26 |
Message-ID: | 20170326214726.sqk4zt2bbq5ofthd@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> Here is a new patch series responding to feedback from Peter and Andres:
+
+/* Per-participant shared state. */
+typedef struct SharedTuplestoreParticipant
+{
+ LWLock lock;
Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's
probably ok, for now.
+ bool error; /* Error occurred flag. */
+ bool eof; /* End of file reached. */
+ int read_fileno; /* BufFile segment file number. */
+ off_t read_offset; /* Offset within segment file. */
Hm. I wonder if it'd not be better to work with 64bit offsets, and just
separate that out upon segment access.
+/* The main data structure in shared memory. */
"main data structure" isn't particularly meaningful.
+struct SharedTuplestore
+{
+ int reading_partition;
+ int nparticipants;
+ int flags;
Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */?
+ Size meta_data_size;
What's this?
+ SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER];
I'd add a comment here, that there's further data after participants.
+};
+
+/* Per-participant backend-private state. */
+struct SharedTuplestoreAccessor
+{
Hm. The name and it being backend-local are a bit conflicting.
+ int participant; /* My partitipant number. */
+ SharedTuplestore *sts; /* The shared state. */
+ int nfiles; /* Size of local files array. */
+ BufFile **files; /* Files we have open locally for writing. */
Shouldn't this mention that it's indexed by partition?
+ BufFile *read_file; /* The current file to read from. */
+ int read_partition; /* The current partition to read from. */
+ int read_participant; /* The current participant to read from. */
+ int read_fileno; /* BufFile segment file number. */
+ off_t read_offset; /* Offset within segment file. */
+};
+/*
+ * Initialize a SharedTuplestore in existing shared memory. There must be
+ * space for sts_size(participants) bytes. If flags is set to the value
+ * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once,
+ * because underlying files will be deleted.
Any reason not to use flags that are compatible with tuplestore.c?
+ * Tuples that are stored may optionally carry a piece of fixed sized
+ * meta-data which will be retrieved along with the tuple. This is useful for
+ * the hash codes used for multi-batch hash joins, but could have other
+ * applications.
+ */
+SharedTuplestoreAccessor *
+sts_initialize(SharedTuplestore *sts, int participants,
+ int my_participant_number,
+ Size meta_data_size,
+ int flags,
+ dsm_segment *segment)
+{
Not sure I like that the naming here has little in common with
tuplestore.h's api.
+
+MinimalTuple
+sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{
This needs docs.
+ SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts);
+ MinimalTuple tuple = NULL;
+
+ for (;;)
+ {
...
+ /* Check if this participant's file has already been entirely read. */
+ if (participant->eof)
+ {
+ BufFileClose(accessor->read_file);
+ accessor->read_file = NULL;
+ LWLockRelease(&participant->lock);
+ continue;
Why are we closing the file while holding the lock?
+
+ /* Read the optional meta-data. */
+ eof = false;
+ if (accessor->sts->meta_data_size > 0)
+ {
+ nread = BufFileRead(accessor->read_file, meta_data,
+ accessor->sts->meta_data_size);
+ if (nread == 0)
+ eof = true;
+ else if (nread != accessor->sts->meta_data_size)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not read from temporary file: %m")));
+ }
+
+ /* Read the size. */
+ if (!eof)
+ {
+ nread = BufFileRead(accessor->read_file, &tuple_size, sizeof(tuple_size));
+ if (nread == 0)
+ eof = true;
Why is it legal to have EOF here, if metadata previously didn't have an
EOF? Perhaps add an error if accessor->sts->meta_data_size != 0?
+ if (eof)
+ {
+ participant->eof = true;
+ if ((accessor->sts->flags & SHARED_TUPLESTORE_SINGLE_PASS) != 0)
+ SharedBufFileDestroy(fileset, accessor->read_partition,
+ accessor->read_participant);
+
+ participant->error = false;
+ LWLockRelease(&participant->lock);
+
+ /* Move to next participant's file. */
+ BufFileClose(accessor->read_file);
+ accessor->read_file = NULL;
+ continue;
+ }
+
+ /* Read the tuple. */
+ tuple = (MinimalTuple) palloc(tuple_size);
+ tuple->t_len = tuple_size;
Hm. Constantly re-allocing this doesn't strike me as a good idea (not to
mention that the API doesn't mention this is newly allocated). Seems
like it'd be a better idea to have a per-accessor buffer where this can
be stored in - increased in size when necessary.
- Andres
From | Date | Subject | |
---|---|---|---|
Next Message | Dmitry Dolgov | 2017-03-26 21:57:10 | Re: [PATCH] few fts functions for jsonb |
Previous Message | Corey Huinker | 2017-03-26 21:42:10 | Re: \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless) |