Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiDistribution #18

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

Conversation

benjamin-lieser
Copy link
Member

  • Added a CHANGELOG.md entry

Summary

Some code related to #16

@benjamin-lieser
Copy link
Member Author

benjamin-lieser commented Mar 2, 2025

I put a &self receiver, like in Distribution @dhardy was there a specific reason why you proposed &mut self?

I am not sure if I want a different name, because it clashed with the one in Distribution and I would like structs to implement both if applicable. Making these things unambiguous feels often too cumbersome in Rust. And it can break existing code when doing use rand_distr::* or similar.

@dhardy
Copy link
Member

dhardy commented Mar 2, 2025

I put a &self receiver, like in Distribution @dhardy was there a specific reason why you proposed &mut self?

Sorry, it should be &self.

I am not sure if I want a different name, because it clashed with the one in Distribution and I would like structs to implement both if applicable. Making these things unambiguous feels often too cumbersome in Rust. And it can break existing code when doing use rand_distr::* or similar.

I'm not really sure of the answer to that. Any of these can work (from the POV of a dependency):

  • use rand_distr::MultiDistribution; — Do we want the multi module to be pub? Likely yes, in which case supporting this path is redundant.
  • use rand_distr::multi::Distribution; — Usable but potential for confusion and can make usage of both Distribution traits annoying
  • use rand_distr::multi::MultiDistribution; — Redundant naming, but avoids the above issues so probably the best choice

@benjamin-lieser
Copy link
Member Author

I was talking about the naming of sample because its the same as in distribution and makes the method ambiguous when both traits are in scope.

I guess I also prefer the last of the 3 options for the name of the trait.

@benjamin-lieser
Copy link
Member Author

Also, should we do this in a v0.6 branch? I guess there will be breaking changes with respect to the Dirichlet at the end and possible some v0.5.* releases.

@dhardy
Copy link
Member

dhardy commented Mar 3, 2025

I was talking about the naming of sample because its the same as in distribution and makes the method ambiguous when both traits are in scope.

This shouldn't be an issue unless a type implements both Distribution and MultiDistribution — but I doubt we'd want that.

Except possibly a symmetric distribution like Dirichlet which could be sampled in one or multiple dimensions? Or if we wanted to transparently support uncorrelated multi-dimensional sampling of 1D distributions like Normal? No, both these are likely bad ideas.

@benjamin-lieser
Copy link
Member Author

benjamin-lieser commented Mar 3, 2025

I was talking about the naming of sample because its the same as in distribution and makes the method ambiguous when both traits are in scope.

This shouldn't be an issue unless a type implements both Distribution and MultiDistribution — but I doubt we'd want that.

Except possibly a symmetric distribution like Dirichlet which could be sampled in one or multiple dimensions? Or if we wanted to transparently support uncorrelated multi-dimensional sampling of 1D distributions like Normal? No, both these are likely bad ideas.

I was thinking about both having MultiDistribution to sample into a buffer and still Distribution which returns a Vec. If someone would anyway allocate for each sample or does not care about allocations, the latter is more convenient.
But I could understand if this would lead to more confusion than it benefits people.

@dhardy
Copy link
Member

dhardy commented Mar 4, 2025

The point of using const generics was to avoid needing to allocate. If allocation is necessary anyway, having both sample_to_buf and sample(...) -> Vec<_> doesn't add much. Moreover, if we are going to support both styles of method, it should be done under the same trait in my opinion — we can implement sample automatically, provided we know the expected sample length.

Which leads to another point: we may want a sample_len or just len method.

So:

pub trait MultiDistribution<T> {
    fn sample_len(&self) -> usize;
    fn sample_to_buf(&self, buf: &mut [T]);
    fn sample(&self) -> Vec<T> where T: Default {
        let mut buf = Vec::new();
        buf.resize_with(self.sample_len(), T::default());
        self.sample_to_buf(&mut buf);
    }
}

That requires T: Default to support sample, which I think is reasonable.

@benjamin-lieser
Copy link
Member Author

benjamin-lieser commented Mar 4, 2025

We should decide if we want to keep the possibility to have multidim sampling without allocations or not. If we do not need it, we can ditch the const generics and have less code and then your proposed Trait would make a lot of sense.

I am still leaning toward keeping it, because I had a usecase where Multinomial samples where extremely time critical and it helps to save the allocations, especially in multithread where there is synchronization with malloc. But this might also be a niche usecase. (I would only sample once per Multinomial)

Edit: Your Trait is also still implementable for a const generic version and you can avoid all allocations. So I would go with this approach.

@dhardy
Copy link
Member

dhardy commented Mar 5, 2025

I thought we did decide to drop the const-generics approach for rand_distr?

Your use-case sounds fairly specific. Maybe there are further optimisations available when sampling only once.

@MortenLohne
Copy link

Would it be possible to have a non-const generic Dirichlet that can still be used without allocating? Without discussing Distribution traits for now, imagine the following API:

impl<F: Float> Dirichlet<F> {
    pub fn new(alpha: Vec<F>) -> Result<Dirichlet<F>, Error>;
    pub fn sample<R: Rng>(&self, rng: &mut R, output: &mut Vec<F>); // Users can re-use this vector to avoid allocations
    pub fn into_vec(self) -> Vec<F>; // Recovers the memory passed in with `new()`
}

This still requires the alloc feature, but so does the current (const generic) implementation, and afaict no one has presented such a use case yet.

For sample(), we could also generalize the output type to iter::Extend, if we want to allow collecting into other data structures.

@benjamin-lieser
Copy link
Member Author

Would it be possible to have a non-const generic Dirichlet that can still be used without allocating? Without discussing Distribution traits for now, imagine the following API:

impl<F: Float> Dirichlet<F> {
    pub fn new(alpha: Vec<F>) -> Result<Dirichlet<F>, Error>;
    pub fn sample<R: Rng>(&self, rng: &mut R, output: &mut Vec<F>); // Users can re-use this vector to avoid allocations
    pub fn into_vec(self) -> Vec<F>; // Recovers the memory passed in with `new()`
}

This still requires the alloc feature, but so does the current (const generic) implementation, and afaict no one has presented such a use case yet.

For sample(), we could also generalize the output type to iter::Extend, if we want to allow collecting into other data structures.

In the case of Dirichlet it needs an array of other distributions (Beta or Gamma), so this would not work. I would also say its a bit to complex of an API.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on comment style below.

More significantly, the sample method now looks identical to Distribution::sample, hence your idea to use that trait likely makes more sense: we can automatically impl Distribution<Vec<T>> where T: Default.

This would also allow an explicit impl of Distribution<[T; N]> where appropriate (e.g. your mentioned const-generic Multinomial).

But, at this point, do we still want the MultiDistribution trait at all?

@@ -0,0 +1,30 @@
//! Contains Multi-dimensional distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the word 'contains' and remove the full-stop since this isn't a sentence.

Comment on lines +3 to +4
//! We provide a trait `MultiDistribution` which allows to sample from a multi-dimensional distribution without extra allocations.
//! All multi-dimensional distributions implement `MultiDistribution` instead of the `Distribution` trait.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally wrap comments at 80 chars width (sometimes up to 100 if the line already has a large indent).

The wording could be a little better, e.g.

The MultiDistribution trait allows sampling a multi-dimensional distribution to a pre-allocated buffer or to a new [Vec].

Comment on lines +9 to +11
/// This trait allows to sample from a multi-dimensional distribution without extra allocations.
/// For convenience it also provides a `sample` method which returns the result as a `Vec`.
pub trait MultiDistribution<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Items have a short one-line description, with additional details in new paragraphs.

Comment on lines +12 to +13
/// returns the length of one sample (dimension of the distribution)
fn sample_len(&self) -> usize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalise the first letter of 'returns'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants