GH-45821: [C++][Compute] Grouper improvements #45822

pitrou · 2025-03-17T14:30:44Z

Rationale for this change

Add the following functionality to the Grouper class:

A Populate method that inserts new keys without returning any group ids
A Lookup method that finds keys among existing ones, without creating new group ids for unknown keys (in this case, a null group id is emitted instead)

Also enhance random tests for Grouper, by using different random seeds and by ensuring that some group keys appear statistically more than once.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

GitHub Issue: [C++][Compute] Make Grouper API more flexible #45821

github-actions · 2025-03-17T14:31:17Z

⚠️ GitHub issue #45821 has been automatically assigned in GitHub to PR creator.

pitrou · 2025-03-17T15:47:32Z

@zanmato1984 Does this look like a good idea?

pitrou · 2025-03-17T15:59:45Z

cpp/src/arrow/compute/row/grouper.cc

-      if (minibatch_size_ * 2 <= minibatch_size_max_) {
-        minibatch_size_ *= 2;
-      }
+      // XXX why not use minibatch_size_max_ from the start?
+      minibatch_size_ = std::min(minibatch_size_max_, 2 * minibatch_size_);


@zanmato1984 Would you know the answer to this XXX?

I don't. It doesn't seem to be necessary for either performance or memory profile.

pitrou · 2025-03-17T16:26:42Z

@github-actions crossbow submit -g cpp

zanmato1984

Some initial comments.

cpp/src/arrow/compute/row/grouper.h

zanmato1984 · 2025-03-19T01:44:03Z

cpp/src/arrow/compute/row/grouper.cc

-      if (minibatch_size_ * 2 <= minibatch_size_max_) {
-        minibatch_size_ *= 2;
-      }
+      // XXX why not use minibatch_size_max_ from the start?
+      minibatch_size_ = std::min(minibatch_size_max_, 2 * minibatch_size_);


I don't. It doesn't seem to be necessary for either performance or memory profile.

zanmato1984 · 2025-03-19T02:58:09Z

cpp/src/arrow/compute/row/grouper.h

+                               int64_t length = -1) = 0;
+
+  /// Like Consume, but only populates the Grouper without returning the group ids.
+  virtual Status Populate(const ExecSpan& batch, int64_t offset = 0,


IIUC, when being utilized to optimize pivot_wider function, the pivot keys will be firstly populated into the grouper. However the pivot_wider is designed to accept the pivot keys as a std::vector of std::string specified in the function option. Will this be a problem for using this API?

My plan is to convert the std::vector<std::string> into an actual ArrayData (or ArraySpan perhaps). This will also help for #45732 (since we'll use the Cast kernels to cast from string to the actual key type).

zanmato1984 · 2025-03-19T03:03:25Z

cpp/src/arrow/compute/row/grouper.cc

+      MapIterator it;
+    };
+
+    auto generate_keys = [&](auto&& lookup_key, auto&& visit_group,


Should we consider factoring these lambdas out as individual private member functions? This could avoid ConsumeImpl being too long.

I'll try to, but given they also access local state, I don't know how ergonomic it will end up.

I think I've improved this quite a bit now, can you take a look again?

I was planning to take a deeper look into the implementation detail after your coming refactoring. Since the refactoring is done, and thanks for the update, I'll just move on looking. Please allow me some more while :)

zanmato1984 · 2025-03-19T03:11:59Z

cpp/src/arrow/compute/row/grouper.cc

@@ -329,6 +330,8 @@ Status CheckAndCapLengthForConsume(int64_t batch_length, int64_t& consume_offset
  return Status::OK();
 }

+enum class GrouperMode { kPopulate, kConsume, kLookup };


Instead of having multiple grouper modes to instruct the only ConsumeImpl, do we want to organize the code the way that each "mode" has its own implementation, and each implementation can be a composition of a series of the underlying common primitive functions.

Of course this is not a strong bias, I'm just feeling that it might make the code more "plain".

Same answer as to the other comment: let my try this out :)

Hmm, I think this is going to get clumsy because the primitive functions would need to allocate and provide a number of temporary buffers (such as offsets_batch and key_bytes_batch). We could of course introduce dedicated structures to hold these.

pitrou · 2025-03-20T16:34:49Z

@github-actions crossbow submit -g cpp

github-actions · 2025-03-20T16:37:32Z

Revision: a3f1f32

Submitted crossbow builds: ursacomputing/crossbow @ actions-9efe9a9b5b

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-meson
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

github-actions bot added Component: C++ awaiting review Awaiting review labels Mar 17, 2025

pitrou force-pushed the grouper-improvements branch from 429fcaf to 2ee5a06 Compare March 17, 2025 15:26

pitrou marked this pull request as ready for review March 17, 2025 15:47

pitrou requested a review from zanmato1984 March 17, 2025 15:47

pitrou commented Mar 17, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 17, 2025

This comment was marked as outdated.

Sign in to view

zanmato1984 reviewed Mar 19, 2025

View reviewed changes

uchenily mentioned this pull request Mar 19, 2025

[C++][Compute][Acero] Poor aggregate performance when there is a large number of batches on the build side #45847

Open

pitrou added 2 commits March 20, 2025 15:01

apacheGH-45821: [C++][Compute] Grouper improvements

2508e12

Reduce the reliance on lambdas

a3f1f32

pitrou force-pushed the grouper-improvements branch from 2ee5a06 to a3f1f32 Compare March 20, 2025 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-45821: [C++][Compute] Grouper improvements #45822

GH-45821: [C++][Compute] Grouper improvements #45822

pitrou commented Mar 17, 2025 •

edited

Loading

github-actions bot commented Mar 17, 2025

pitrou commented Mar 17, 2025

pitrou Mar 17, 2025

zanmato1984 Mar 19, 2025

pitrou commented Mar 17, 2025

This comment was marked as outdated.

zanmato1984 left a comment

zanmato1984 Mar 19, 2025

zanmato1984 Mar 19, 2025

pitrou Mar 19, 2025

zanmato1984 Mar 19, 2025

pitrou Mar 19, 2025

pitrou Mar 20, 2025

zanmato1984 Mar 21, 2025

zanmato1984 Mar 19, 2025

pitrou Mar 19, 2025

pitrou Mar 20, 2025

pitrou commented Mar 20, 2025

github-actions bot commented Mar 20, 2025

GH-45821: [C++][Compute] Grouper improvements #45822

Are you sure you want to change the base?

GH-45821: [C++][Compute] Grouper improvements #45822

Conversation

pitrou commented Mar 17, 2025 • edited Loading

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Mar 17, 2025

pitrou commented Mar 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Mar 17, 2025

This comment was marked as outdated.

zanmato1984 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Mar 20, 2025

github-actions bot commented Mar 20, 2025

pitrou commented Mar 17, 2025 •

edited

Loading