Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbiased floating point number in unit interval #102

Closed

Conversation

curiousleo
Copy link
Collaborator

@curiousleo curiousleo commented Apr 21, 2020

Based on http://allendowney.com/research/rand/downey07randfloat.pdf

Discussion

See #105.

Performance

Results show the time taken for 100000 iterations.

v1.1 from #111

pure/random/Float                        mean 30.63 ms  ( +- 3.753 ms  )
pure/randomR/unbounded/Float             mean 60.17 ms  ( +- 5.086 ms  )

pure/random/Double                       mean 54.18 ms  ( +- 3.741 ms  )
pure/randomR/unbounded/Double            mean 92.29 ms  ( +- 8.487 ms  )

current interface-to-performance (d03e325 specifically)

pure/random/Float                        mean 365.6 μs  ( +- 3.672 μs  )
pure/uniformR/unbounded/Float            mean 357.0 μs  ( +- 24.55 μs  )

pure/random/Double                       mean 454.0 μs  ( +- 4.255 μs  )
pure/uniformR/unbounded/Double           mean 408.9 μs  ( +- 6.579 μs  )

with this PR

pure/random/Float                        mean 420.7 μs  ( +- 4.416 μs  )
pure/uniformR/unbounded/Float            mean 504.7 μs  ( +- 5.863 μs  )

pure/random/Double                       mean 428.2 μs  ( +- 5.526 μs  )
pure/uniformR/unbounded/Double           mean 451.4 μs  ( +- 11.43 μs  )

Copy link
Collaborator

@lehins lehins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything obviously wrong in this PR, but I can't vouch for the math of it.
It would be nice to include some references to why the material that explain why this all works

@curiousleo
Copy link
Collaborator Author

It would be nice to include some references to why the material that explain why this all works

Absolutely. @idontgetoutmuch and I are thinking of writing this up more formally somewhere. Downey's draft paper explains the underlying idea, but the exposition can be improved.

@peteroupc
Copy link

peteroupc commented Apr 21, 2020

@christoph-conrads, since you have created the Rademacher Floating-Point Library, which is a more robust way to generate random floating-point numbers, please take a look at this and give your thoughts on whether the approach given by Downey here is actually robust in terms of—

  • generating all representable numbers in [0, 1),
  • avoiding rounding errors, and
  • maintaining a uniform distribution throughout the range.

@idontgetoutmuch
Copy link
Owner

@peteroupc I can't read template C++ but I think that it is highly likely to be the same as the Downey approach. Also our implementation now passes the tests in the Rademacher floating point library but is about 3 times slower than just selecting random bits for the mantissa and fixing the exponent so that the number is in the binade [1,2). We need to write up the maths behind our / Downey's approach; I don't think comments in the code works very well for this.

@peteroupc
Copy link

We need to write up the maths behind our / Downey's approach; I don't think comments in the code works very well for this.

Yes, I would like to see that. In addition, there should be something similar for generating any representable random floating-point number in ranges other than [0, 1). In particular, the algorithm used in the Rademacher FPL to achieve this for any range is anything but trivial, and I haven't been able to figure it out yet.

@lehins lehins mentioned this pull request Apr 21, 2020
@curiousleo curiousleo force-pushed the unbiased-floating-point-rebased branch from c3f5f47 to 7cab4e6 Compare April 23, 2020 11:59
@christoph-conrads
Copy link

christoph-conrads commented Apr 23, 2020

The comments refer to Allen Downey's paper Generating Pseudo-random Floating-Point Values and not to the Haskell code.

@christoph-conrads, since you have created the Rademacher Floating-Point Library, which is a more robust way to generate random floating-point numbers, please take a look at this and give your thoughts on whether the approach given by Downey here is actually robust in terms of—

  • generating all representable numbers in [0, 1),

yes, in [0, 1] (closed interval!)

  • avoiding rounding errors, and

Yes, the algorithm uses only integer arithmetic.

  • maintaining a uniform distribution throughout the range

When we speak of generating uniformly distributed floating-point (FP) numbers, what we want to do is to generate FP numbers as if one

  • draws a truly uniformly distributed random real number and
  • rounds it to a FP number.

In Downey's algorithm, you are using round-to-nearest (see Figure 5 in his paper). That is, Downey's algorithm may be drawing real numbers from the half-open interval [0,1) but the rounded representation is in [0,1].

I can't read template C++ but I think that it is highly likely to be the same as the Downey approach.

Not quite. the Rademacher Floating-Point Library

  • uses round-toward-zero in its mathematical model, and
  • never looks at individual bits.

Instead of looking at one random bit at a time, RPFL takes the raw pseudo-random number generator output x in [0, 2^b) and directly computes the exponent by computing

exponent = floor(log2(x))

In integer arithmetic this can be done cheaply because calculating floor(log2(x)) is equivalent to counting the number of trailing zeros; many CPUs have built-in instructions for this purpose and GCC offers __builtin_clz().

The code below is a quickly coded benchmark comparing the two approaches. The RFPL-approach is slightly faster on my x86_64 computer with GCC 9.2.0 but the extra implementation effort may not be worth it because you need on average only two bits and the result is probably highly compiler- and machine-dependent.

Note the rfpl_rand() generates floats in the half-open interval [0,1) whereas downey_rand() computes floats in the closed interval [0,1] because of the branch condition mant == 0 && random_bit in line 76.

// This code attempts to demonstrate the performance differences between drawing
// one and multiple random bits at a time when computing floating-point values
// in the unit interval.
//
// Compile with (`DNEBUG` is essential because of the assertions):
//   c++ -Wextra -Wall -std=c++11 -pedantic single-vs-block.cpp
//       -O2 -march=native -DNEBUG
//
// Author: Christoph Conrads (https://christoph-conrads.name)

#include <cassert>
#include <cmath>
#include <cstdint>
#include <cstdio>
#include <ctime>
#include <random>

// this code uses the noinline attribute to prevent extreme compiler
// optimizations

// Code based on
// Allen B. Downey: "Generating Pseudo-random Floating-Point Values" (2007).
// URL: http://allendowney.com/research/rand/

typedef union box {
	float f;
	unsigned u;
} Box;

/* GET_BIT: returns a random bit.
 * For efficiency, bits are generated 31 at a time using the C library function
 * random ()
 */
int get_bit (std::mt19937_64& gen) {
	int bit;
	static int bits = 0;
	static std::uint64_t x;

	if (bits == 0) {
		x = gen();
		bits = 64;
	}

	bit = x & 1;
	x = x >> 1;
	bits--;
	return bit;
}

/* RANDF: returns a random floating-point number in the range (0, 1), including
 * 0.0, subnormals, and 1.0
 */
float downey_rand(std::mt19937_64& gen) __attribute__((noinline));
float downey_rand(std::mt19937_64& gen) {
	int mant, exp, high_exp, low_exp;
	Box low, high, ans;

	low.f = 0.0;
	high.f = 1.0;

	/* extract the exponent fields from low and high */
	low_exp = (low.u >> 23) & 0xFF;
	high_exp = (high.u >> 23) & 0xFF;
	/* choose random bits and decrement exp until a 1 appears. the reason for
	 * subracting one from high_exp is left as an exercise for the reader
	 */
	for(exp = high_exp-1; exp > low_exp; exp--) {
		if (get_bit(gen))
			break;
	}
	/* choose a random 23-bit mantissa */
	mant = gen() & 0x7FFFFF;
	/* if the mantissa is zero, half the time we should move to the next
	 * exponent range
	 */
	if(mant == 0 && get_bit(gen))
		exp++;

	/* combine the exponent and the mantissa */
	ans.u = (exp << 23) | mant;

	assert( std::isfinite(ans.f) );
	assert( ans.f >= 0 );
	assert( ans.f <= 1 );

	return ans.f;
}



float rfpl_rand(std::mt19937_64& gen) __attribute__((noinline));
float rfpl_rand(std::mt19937_64& gen) {
	Box low, high, ans;

	low.f = 0.0;
	high.f = 1.0;

	/* extract the exponent fields from low and high */
	auto low_exp = (low.u >> 23) & 0xFF;
	auto high_exp = (high.u >> 23) & 0xFF;
	// number of bits generated by the random number generator
	auto num_bits = 64u;
	auto exp = high_exp + 1;

	for(auto exp_max = high_exp; true; exp_max -= num_bits) {
		auto exp_min = (exp_max >= num_bits) ? (exp_max - num_bits) : 0u;
		auto x = gen();

		if(x == 0) {
			if(exp_max <= num_bits)
			{
				exp = low_exp;
				break;
			}

			continue;
		}

		// the built-ins have undefined behavior for x == 0
		assert(x != 0);

		auto k = __builtin_clz(x);
		exp = exp_min + num_bits - k - 1;
		break;
	}

	assert(exp >= low_exp);
	assert(exp <= high_exp);

	/* choose a random 23-bit mantissa */
	auto mant = gen() & 0x7FFFFFu;

	/* combine the exponent and the mantissa */
	ans.u = (exp << 23) | mant;

	assert( std::isfinite(ans.f) );
	assert( ans.f >= 0 );
	assert( ans.f < 1 );

	return ans.f;
}


int main() {
	auto num_downey = std::uintmax_t{0};
	auto dummy_downey = float{0};
	auto num_rfpl = std::uintmax_t{0};
	auto dummy_rfpl = float{0};

	auto gen = std::mt19937_64();

	auto c_delta = 5 * CLOCKS_PER_SEC;
	auto c0 = std::clock();
	auto c1 = c0;

	// dummy code to avoid CPU frequency scaling
	auto dummy = float{1};
	auto eps = std::numeric_limits<float>::epsilon();

	for( ; c1 < c0 + c_delta; c1 = std::clock()) {
		dummy = (1 + eps) * dummy;
	}

	std::printf("CPU frequency scaling dummy: %8.2e\n", dummy);

	// measurement Downey approach
	c0 = std::clock();
	c1 = c0;

	for( ; c1 < c0 + c_delta; c1 = std::clock()) {
		for(auto i = 0u; i < 1000u * 1000u; ++i) {
			++num_downey;
			dummy_downey += downey_rand(gen);
		}
	}

	auto t_downey_sec = 1.0 * (c1 - c0) / CLOCKS_PER_SEC;

	std::printf("Downey: %8.2e per sec\n", 1.0 * num_downey / t_downey_sec);
	std::printf("Downey dummy: %8.2e\n", dummy_downey);

	// measurement Rademacher Floating-Point Library approach
	c0 = std::clock();
	c1 = c0;

	for( ; c1 < c0 + c_delta; c1 = std::clock()) {
		for(auto i = 0u; i < 1000u * 1000u; ++i) {
			++num_rfpl;
			dummy_rfpl += rfpl_rand(gen);
		}
	}

	auto t_rfpl_sec = 1.0 * (c1 - c0) / CLOCKS_PER_SEC;

	std::printf("RFPL: %8.2e per sec\n", 1.0 * num_rfpl / t_rfpl_sec);
	std::printf("RPFL dummy: %8.2e\n", dummy_rfpl);
}

@curiousleo curiousleo force-pushed the unbiased-floating-point-rebased branch from 7cab4e6 to cd51766 Compare April 24, 2020 10:58
@curiousleo
Copy link
Collaborator Author

curiousleo commented Apr 24, 2020

@christoph-conrads thanks for your comments!

In integer arithmetic this can be done cheaply because calculating floor(log2(x)) is equivalent to counting the number of trailing zeros; many CPUs have built-in instructions for this purpose and GCC offers __builtin_clz().

In Haskell there is countLeadingZeros, which is used in this implementation directly:

then return $ acc + countLeadingZeros w

If I understand it correctly, rademacher-fpl uses logarithms instead of clz because it aims to support non-power-of-2 PRNGs directly. The assumption in this implementation is that the underlying source of randomness produces random 32-bit or 64-bit words, which allows us to use countLeadingZeros to compute the required geometric distribution over the exponents.

uses round-toward-zero in its mathematical model

I'm curious about this one. What's the motivation for / what are the consequences of this choice?

@christoph-conrads
Copy link

If I understand it correctly, rademacher-fpl uses logarithms instead of clz because it aims to support non-power-of-2 PRNGs directly.

std::log2p1() (from the C++20 draft, renamed to std::bit_width() in the meantime) is implemented using clz, see rademacher-fpl/bit.hpp. In rademacher-fpl/impl/uniform-int-distribution.hpp, line 40, RFPL provides its own std::uniform_int_distribution clone because it

  • benefits performance-wise if the underlying range is [0, 2^b) for some b.
  • works around GCC bug 80977.

uses round-toward-zero in its mathematical model

I'm curious about this one. What's the motivation for / what are the consequences of this choice?

Intuitively it felt right for me if all significands have equal probability if the exponent is known. Now you have rounding toward zero.

Consequences:

  • Given the exponent, all significands are equally likely.
  • Signed and unsigned floats behave perfectly symmetrical (they would not with, e.g., rounding toward positive infinity)
  • You must treat +0 and -0 as different numbers (Yes, RFPL can draw from [-0,+0]).
  • Given real numbers a < b that can be represented exactly in floating-point arithmetic, the generator returns FP values as if it had randomly drawn and rounded real values from
    • [a,b) if a ≥ 0,
    • (a',b] if b ≤ 0, where a' = nextafter(a, -∞) is the next smaller float after a,
    • (a', b) if a < 0, b > 0.

@idontgetoutmuch
Copy link
Owner

@christoph-conrads FYI

[nix-shell:~/random]$ ./a.out
CPU frequency scaling dummy: 2.13e+00
Downey: 6.49e+07 per sec
Downey dummy: 1.68e+07
RFPL: 7.19e+07 per sec
RPFL dummy: 1.68e+07

@idontgetoutmuch
Copy link
Owner

@curiousleo I think we can now close this?

@curiousleo
Copy link
Collaborator Author

@curiousleo I think we can now close this?

I guess the approach taken in this PR is neither here nor there: while it does, I believe, give us all representable numbers in the unit interval as a starting point, we're still translating that into the target interval using floating-point operations.

Quoting from #113 (comment):

a + x * (b-a) does not work in floating-point (FP) arithmetic. Period. [...]
If you insist on this approach, you do not need to discuss intricacies of interval boundaries because the rounding error makes the discussion moot.

So I suppose this PR as it is isn't going to fly. Either we keep things "simple but subtly wrong" like most libraries out there, or we go all-in on generating arbitrary ranges with full coverage.

I intend to spend some time coming up with evil test cases to prod the implementation we have right now to see in what ways the current implementation has surprising behaviour. That should help inform our next steps.

@christoph-conrads
Copy link

@christoph-conrads FYI

[nix-shell:~/random]$ ./a.out
CPU frequency scaling dummy: 2.13e+00
Downey: 6.49e+07 per sec
Downey dummy: 1.68e+07
RFPL: 7.19e+07 per sec
RPFL dummy: 1.68e+07

Thank you for the measurement.
Are you using GCC 8 or 9 on x86-64? The performance might be the other way around with, say, GCC 7 on ARM.

@curiousleo
Copy link
Collaborator Author

Out of curiosity, I have proved that a variant of this PR (without the carry bit, that's too much for z3) is indeed "dense" in [0,1) with sbv. Running the following gives "Q.E.D.":

main :: IO ()
main = do
  proof <- prove $ do
    f <- forall "f"
    w32 <- exists "w32"
    w64a <- exists "w64a"
    w64b <- exists "w64b" 
    constrain $ 0 .<= f .&& f .< 1
    return $ floatInUnitIntervalS w32 w64a w64b .== f
  putStrLn $ show proof
  
floatInUnitIntervalS :: SWord32 -> SWord64 -> SWord64 -> SFloat
floatInUnitIntervalS w32 w64a w64b =
  let maxExp = 126
      mantissaBits = 23                                           
      carryMask = (2 :: Word32) ^ mantissaBits                    
      mantissaMask = (2 ^ mantissaBits) - 1
  
      m = w32 .&. mantissaMask
      da = sCountLeadingZeros w64a
      db = sCountLeadingZeros w64b
      d = ite (da .< 64) da (smin (64 + db) maxExp)                
      e = sFromIntegral (maxExp - d) :: SWord32
  in
      sWord32AsSFloat $ (e `shiftL` mantissaBits) .|. m

That's not a very strong property, but it's something :)

@curiousleo
Copy link
Collaborator Author

Closing; see #113 (comment) for context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants