Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45821: [C++][Compute] Grouper improvements #45822

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Mar 17, 2025

Rationale for this change

Add the following functionality to the Grouper class:

  1. A Populate method that inserts new keys without returning any group ids
  2. A Lookup method that finds keys among existing ones, without creating new group ids for unknown keys (in this case, a null group id is emitted instead)

Also enhance random tests for Grouper, by using different random seeds and by ensuring that some group keys appear statistically more than once.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Copy link

⚠️ GitHub issue #45821 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou force-pushed the grouper-improvements branch from 429fcaf to 2ee5a06 Compare March 17, 2025 15:26
@pitrou pitrou marked this pull request as ready for review March 17, 2025 15:47
@pitrou pitrou requested a review from zanmato1984 March 17, 2025 15:47
@pitrou
Copy link
Member Author

pitrou commented Mar 17, 2025

@zanmato1984 Does this look like a good idea?

Comment on lines -691 to +793
if (minibatch_size_ * 2 <= minibatch_size_max_) {
minibatch_size_ *= 2;
}
// XXX why not use minibatch_size_max_ from the start?
minibatch_size_ = std::min(minibatch_size_max_, 2 * minibatch_size_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zanmato1984 Would you know the answer to this XXX?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't. It doesn't seem to be necessary for either performance or memory profile.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 17, 2025
@pitrou
Copy link
Member Author

pitrou commented Mar 17, 2025

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

Comment on lines -691 to +793
if (minibatch_size_ * 2 <= minibatch_size_max_) {
minibatch_size_ *= 2;
}
// XXX why not use minibatch_size_max_ from the start?
minibatch_size_ = std::min(minibatch_size_max_, 2 * minibatch_size_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't. It doesn't seem to be necessary for either performance or memory profile.

int64_t length = -1) = 0;

/// Like Consume, but only populates the Grouper without returning the group ids.
virtual Status Populate(const ExecSpan& batch, int64_t offset = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, when being utilized to optimize pivot_wider function, the pivot keys will be firstly populated into the grouper. However the pivot_wider is designed to accept the pivot keys as a std::vector of std::string specified in the function option. Will this be a problem for using this API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan is to convert the std::vector<std::string> into an actual ArrayData (or ArraySpan perhaps). This will also help for #45732 (since we'll use the Cast kernels to cast from string to the actual key type).

MapIterator it;
};

auto generate_keys = [&](auto&& lookup_key, auto&& visit_group,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider factoring these lambdas out as individual private member functions? This could avoid ConsumeImpl being too long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to, but given they also access local state, I don't know how ergonomic it will end up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've improved this quite a bit now, can you take a look again?

@@ -329,6 +330,8 @@ Status CheckAndCapLengthForConsume(int64_t batch_length, int64_t& consume_offset
return Status::OK();
}

enum class GrouperMode { kPopulate, kConsume, kLookup };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having multiple grouper modes to instruct the only ConsumeImpl, do we want to organize the code the way that each "mode" has its own implementation, and each implementation can be a composition of a series of the underlying common primitive functions.

Of course this is not a strong bias, I'm just feeling that it might make the code more "plain".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as to the other comment: let my try this out :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this is going to get clumsy because the primitive functions would need to allocate and provide a number of temporary buffers (such as offsets_batch and key_bytes_batch). We could of course introduce dedicated structures to hold these.

@pitrou pitrou force-pushed the grouper-improvements branch from 2ee5a06 to a3f1f32 Compare March 20, 2025 16:20
@pitrou
Copy link
Member Author

pitrou commented Mar 20, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: a3f1f32

Submitted crossbow builds: ursacomputing/crossbow @ actions-9efe9a9b5b

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants