-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45821: [C++][Compute] Grouper improvements #45822
base: main
Are you sure you want to change the base?
Conversation
|
429fcaf
to
2ee5a06
Compare
@zanmato1984 Does this look like a good idea? |
if (minibatch_size_ * 2 <= minibatch_size_max_) { | ||
minibatch_size_ *= 2; | ||
} | ||
// XXX why not use minibatch_size_max_ from the start? | ||
minibatch_size_ = std::min(minibatch_size_max_, 2 * minibatch_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zanmato1984 Would you know the answer to this XXX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't. It doesn't seem to be necessary for either performance or memory profile.
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
if (minibatch_size_ * 2 <= minibatch_size_max_) { | ||
minibatch_size_ *= 2; | ||
} | ||
// XXX why not use minibatch_size_max_ from the start? | ||
minibatch_size_ = std::min(minibatch_size_max_, 2 * minibatch_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't. It doesn't seem to be necessary for either performance or memory profile.
int64_t length = -1) = 0; | ||
|
||
/// Like Consume, but only populates the Grouper without returning the group ids. | ||
virtual Status Populate(const ExecSpan& batch, int64_t offset = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, when being utilized to optimize pivot_wider
function, the pivot keys will be firstly populated into the grouper. However the pivot_wider
is designed to accept the pivot keys as a std::vector
of std::string
specified in the function option. Will this be a problem for using this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan is to convert the std::vector<std::string>
into an actual ArrayData
(or ArraySpan
perhaps). This will also help for #45732 (since we'll use the Cast kernels to cast from string to the actual key type).
cpp/src/arrow/compute/row/grouper.cc
Outdated
MapIterator it; | ||
}; | ||
|
||
auto generate_keys = [&](auto&& lookup_key, auto&& visit_group, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider factoring these lambdas out as individual private member functions? This could avoid ConsumeImpl
being too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to, but given they also access local state, I don't know how ergonomic it will end up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've improved this quite a bit now, can you take a look again?
@@ -329,6 +330,8 @@ Status CheckAndCapLengthForConsume(int64_t batch_length, int64_t& consume_offset | |||
return Status::OK(); | |||
} | |||
|
|||
enum class GrouperMode { kPopulate, kConsume, kLookup }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having multiple grouper modes to instruct the only ConsumeImpl
, do we want to organize the code the way that each "mode" has its own implementation, and each implementation can be a composition of a series of the underlying common primitive functions.
Of course this is not a strong bias, I'm just feeling that it might make the code more "plain".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as to the other comment: let my try this out :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this is going to get clumsy because the primitive functions would need to allocate and provide a number of temporary buffers (such as offsets_batch
and key_bytes_batch
). We could of course introduce dedicated structures to hold these.
2ee5a06
to
a3f1f32
Compare
@github-actions crossbow submit -g cpp |
Revision: a3f1f32 Submitted crossbow builds: ursacomputing/crossbow @ actions-9efe9a9b5b |
Rationale for this change
Add the following functionality to the
Grouper
class:Populate
method that inserts new keys without returning any group idsLookup
method that finds keys among existing ones, without creating new group ids for unknown keys (in this case, a null group id is emitted instead)Also enhance random tests for
Grouper
, by using different random seeds and by ensuring that some group keys appear statistically more than once.Are these changes tested?
Yes.
Are there any user-facing changes?
No.