-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
petname --non-repeating
applies randomness "unevenly"
#61
Comments
I'm not sure how exactly, but |
rust-random/rand#270 talks about generating random unique integers. We can get the name at a particular index with something like: pub fn at(petnames: &Petnames, index: u128, words: u8, separator: &str) -> String {
let mut petname = String::new();
let mut index = index;
for list in Lists::new(petnames, words) {
if !list.is_empty() {
let idx = index % (list.len() as u128);
let word = list[idx as usize];
if petname.is_empty() {
petname.push_str(word);
} else {
petname.push_str(separator);
petname.push_str(word);
}
index = index.saturating_div(list.len() as u128)
}
}
petname
} We could plug a Can we fairly map a We could keep asking the PRNG for numbers until it produces one in range, but this could take years, literally, so that's not going to work. |
rust-random/rand#270 (comment) is interesting in particular because it points to actual PRNGs that could work, like xoroshiro, for which there are implementations in Rust. I'm a bit confused by its description of equidistribution:
Does that mean with a seed of at least 64 bits, we'll only ever see each 64-bit value once in the output? The "except for the zero tuple" bit could be ambiguous: will we see the zero tuple more than once? I don't really know how to read this information, so maybe it's not ambiguous at all, but I'll have to check with someone who knows. |
The mersenne_twister crate looks interesting. but I'm not sure if it repeats or not. |
I haven't had time to look into the ideas above any more. I'm also not sure this is a problem worth solving in this crate, so I'm probably going to remove the non-repeating iterator from 2.0. I had some last thoughts, recorded below, which could be useful for someone who wants to generate large numbers of randomly generated unique names, i.e. without the unevenness problem the current non-repeating iterator suffers from. For shorter runs a Bloom filter could be used to keep a stream unique. Actually, for example, using the bloom crate about 3.1GiB would be needed to ensure With this kind of post hoc filtering there's something to be aware of: when the cardinality of the petname stream is not well above the number of unique items wanted, it will get increasingly inefficient to find each new name as time goes on – if we're generating at random. Put another way: if we want 100 unique names and the stream has a cardinality of 100 too, when we ask for the 100th unique name we might generate 99 names that we've already seen before we find the one remaining not-yet-seen name. But even that is conservative: when generating at random it's possible we'll never generate that single last name that we know is possible. |
Fixes #61. This issue also documents the problems that led to this, thoughts about how it could be resurrected, and how to work around it if this is something you were using.
When running
petname --non-repeating
, for simplicity of implementation the word lists are shuffled only once at the start, then iterated through like a counter. For example, supposing we have 2 word lists:The output of
petname --non-repeating --stream
might be:Note that for the second word, we see all the ys, then al the zs, then the xs. It could be in any order, but they'll always be grouped. Note also that for the first word the order in which we see them is repeated.
It's easy to observe this:
My expectation is that there would be no obvious patterns like this.
The text was updated successfully, but these errors were encountered: