Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

petname --non-repeating applies randomness "unevenly" #61

Closed
allenap opened this issue Apr 7, 2022 · 5 comments · Fixed by #101
Closed

petname --non-repeating applies randomness "unevenly" #61

allenap opened this issue Apr 7, 2022 · 5 comments · Fixed by #101
Labels
enhancement future For a future major version; maybe a breaking change

Comments

@allenap
Copy link
Owner

allenap commented Apr 7, 2022

When running petname --non-repeating, for simplicity of implementation the word lists are shuffled only once at the start, then iterated through like a counter. For example, supposing we have 2 word lists:

a, b, c
x, y, z

The output of petname --non-repeating --stream might be:

c-y
a-y
b-y
c-z
a-z
b-z
c-x
a-x
b-x

Note that for the second word, we see all the ys, then al the zs, then the xs. It could be in any order, but they'll always be grouped. Note also that for the first word the order in which we see them is repeated.

It's easy to observe this:

$ petname --non-repeating --alliterate-with k --count 30
known-koala
key-koala
keen-koala
kind-koala
knowing-koala
known-killdeer
key-killdeer
keen-killdeer
kind-killdeer
knowing-killdeer
known-kid
key-kid
keen-kid
kind-kid
knowing-kid
known-kite
key-kite
keen-kite
kind-kite
knowing-kite
known-kiwi
key-kiwi
keen-kiwi
kind-kiwi
knowing-kiwi
known-kitten
key-kitten
keen-kitten
kind-kitten
knowing-kitten

My expectation is that there would be no obvious patterns like this.

@allenap
Copy link
Owner Author

allenap commented Apr 30, 2022

I'm not sure how exactly, but partial_shuffle may be useful.

@allenap
Copy link
Owner Author

allenap commented Apr 30, 2022

rust-random/rand#270 talks about generating random unique integers.

We can get the name at a particular index with something like:

pub fn at(petnames: &Petnames, index: u128, words: u8, separator: &str) -> String {
    let mut petname = String::new();
    let mut index = index;
    for list in Lists::new(petnames, words) {
        if !list.is_empty() {
            let idx = index % (list.len() as u128);
            let word = list[idx as usize];
            if petname.is_empty() {
                petname.push_str(word);
            } else {
                petname.push_str(separator);
                petname.push_str(word);
            }
            index = index.saturating_div(list.len() as u128)
        }
    }
    petname
}

We could plug a u128 non-repeating PRNG into this. We'd need to be careful that index is never greater than the cardinality of the product of the lists coming out of Lists::new(petnames, words) because wraparound would mean we repeat a word.

Can we fairly map a u128 to a range less than 0..=u128::MAX? If we artificially restricted cardinality to a power of 2 – e.g. by truncating each word list to a length of the nearest power of 2 – we could just mask out bits from the PRNG and that would be fair (well distributed?) I think.

We could keep asking the PRNG for numbers until it produces one in range, but this could take years, literally, so that's not going to work.

@allenap
Copy link
Owner Author

allenap commented Apr 30, 2022

rust-random/rand#270 (comment) is interesting in particular because it points to actual PRNGs that could work, like xoroshiro, for which there are implementations in Rust. I'm a bit confused by its description of equidistribution:

Every 64-bit generator of ours with n bits of state scrambled with * or ** is n/64-dimensionally equidistributed: every n/64-tuple of consecutive 64-bit values appears exactly once in the output, except for the zero tuple (and this is the largest possible dimension). Generators based on the + or ++ scramblers are however only (n/64 − 1)-dimensionally equidistributed: every (n/64 − 1)-tuple of consecutive 64-bit values appears exactly 264 times in the output, except for a missing zero tuple. The same considerations apply to 32-bit generators.

Does that mean with a seed of at least 64 bits, we'll only ever see each 64-bit value once in the output? The "except for the zero tuple" bit could be ambiguous: will we see the zero tuple more than once? I don't really know how to read this information, so maybe it's not ambiguous at all, but I'll have to check with someone who knows.

@allenap
Copy link
Owner Author

allenap commented Apr 30, 2022

The mersenne_twister crate looks interesting. but I'm not sure if it repeats or not.

@allenap
Copy link
Owner Author

allenap commented Sep 12, 2023

I haven't had time to look into the ideas above any more. I'm also not sure this is a problem worth solving in this crate, so I'm probably going to remove the non-repeating iterator from 2.0. I had some last thoughts, recorded below, which could be useful for someone who wants to generate large numbers of randomly generated unique names, i.e. without the unevenness problem the current non-repeating iterator suffers from.


For shorter runs a Bloom filter could be used to keep a stream unique. Actually, for example, using the bloom crate about 3.1GiB would be needed to ensure u32::MAX unique items (this is as many as it will support anyway; playground link for the calculation). 3GiB can probably be spared if one really needs 4 billion unique names!

With this kind of post hoc filtering there's something to be aware of: when the cardinality of the petname stream is not well above the number of unique items wanted, it will get increasingly inefficient to find each new name as time goes on – if we're generating at random. Put another way: if we want 100 unique names and the stream has a cardinality of 100 too, when we ask for the 100th unique name we might generate 99 names that we've already seen before we find the one remaining not-yet-seen name. But even that is conservative: when generating at random it's possible we'll never generate that single last name that we know is possible.

@allenap allenap added the future For a future major version; maybe a breaking change label Sep 12, 2023
allenap added a commit that referenced this issue Sep 12, 2023
Fixes #61. This issue also documents the problems that led to this, thoughts
about how it could be resurrected, and how to work around it if this is
something you were using.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement future For a future major version; maybe a breaking change
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant