Generating random distributions
By bmerry
TopCoder Member
One of the features that I have found interesting about the Marathon
Match format is that the data is hidden, yet you have a pretty good
idea of what it is going to look like. This is because all the cases
are randomly generated, and the problem always contains a detailed
specification of the generation method. While there
is sometimes a visualization tool that can generate test cases
for you, other times you'll have to do it yourself. In this tutorial,
we'll look at a few probability distributions and how to
generate values based on them.
I'll be discussing the issues in terms of C/C++ (my
language of choice). Accordingly, some of the initial, low-level stuff
won't be applicable to other languages that have more
user-friendly random number libraries. However, the later
information will be applicable to all.
Uniform scalar and discrete distributions
Let's start with something simple: generating a single
variable. The basic building block is the uniform discrete
distribution, meaning that each of N possibilities is equally
likely. The approach that is often used when you want an
arbitrary number and aren't worried about things being exactly
uniform is rand() % N , but there are a few problems
with this. Firstly, some random number generators are badly
designed, with the low-order bits repeating quite quickly. To
be safe, I'd recommend that you use something like:
(int) (N * (rand() / (RAND_MAX + 1.0)))
This will first obtain a real number in the range [0, 1), then scale and truncate it. Another problem is that unless N
is a factor of RAND_MAX + 1 , either method will
generate a small amount of bias, particularly if N is quite
large. Think about trying to make a choice among 5 options
using a 6-sided die: there is no way to map the 6 die rolls to
the 5 options in an unbiased way. The solution is the
metaphorical equivalent of rolling the die again if a 6 comes
up: we identify a useful range of random outputs (a multiple
of N), and if anything else comes up, we discard it. The
simplest solution is to reject any rand() that
is at least RAND_MAX - RAND_MAX % N . This uses as
much of the range as possible, and since this is always at
least half of it, the average number of calls to
rand() is at most 2 and usually far less.
If your favourite language doesn't provide you with a way
to generate a random real number in the range [0, 1), you can
generate one as rand() / (RAND_MAX + 1.0) . This
is not perfect, because it will only generate one of typically
231 values, but for most purposes that is good
enough.
If I'm writing any kind of data generator, I start by
writing utility functions to implement these algorithms, so
that I don't need to worry about the details when I'm doing
the interesting parts.
Non-uniform discrete distributions
Let's say that you have a loaded die, with specified
probabilities for each outcome. This can be simulated by
mapping each outcome to a suitable range of the [0, 1)
interval, then generating a uniform random number. For
example, suppose a die has probabilities of 0.1, 0.2, 0.1,
0.3, 0.15 and 0.15 of coming up 1, 2, 3, 4, 5 or 6. We can map
these events to the ranges [0, 0.1), [0.1, 0.3), [0.3, 0.4),
[0.4, 0.7), [0.7, 0.85) and [0.85, 1.0). We then generate a
random number in [0, 1) with a uniform (unbiased)
distribution, and check which range it lies in. If there are
many possible events, a binary search can accelerate this
lookup.
In some cases, the probabilities will change at run-time. A
real-world instance of this is in streaming data compression,
where the probability of seeing a piece of text is updated
depending on the number of times it has already been seen. A
binary
indexed tree is ideal for this type of problem.
Non-uniform continuous distributions
The idea of the previous section can be generalized to
continuous distributions. As an example, let's suppose that
some number lies in the range [0, 1), but the probability of
it being x is directly proportional to x (so
values close to 1 are more likely). First, let's find out
exactly what the probability density function is: it's going
to have the form p(x) = ax, but we need to
choose a so that the integral is 1. Doing a little
calculus shows that a = 2. Next, we want the cumulative
density function, which is the indefinite integral of the
probability density function, in this case c(x) =
x2. To generate a random number with a given
distribution, the procedure is to start with a uniform random
number in the range [0, 1), then pass it through the
inverse of the cumulative distribution function. In
this case, we just take sqrt(r), where r is a
uniform random number in [0, 1).
This works when the cumulative density function is both
known and reasonably easy to invert (although binary search
can always be used for the inversion). Unfortunately, the most
important distribution, the Gaussian or normal
distribution, does not have a formula for its cumulative
density function. Curiously, it is easier to generate
two normally distributed random numbers. The
following algorithm will do this. I won't provide a proof, but
if you want to try to prove it yourself, think about
integrating the bivariate Gaussian distribution in polar
form.
- Pick two random numbers x and y
uniformly from the interval [-1, 1].
- Let r2 = x2 +
y2.
- If r2 > 1, go back to step
1.
- Let u = sqrt(-2 log(r2) /
r2)x.
- Let v = sqrt(-2 log(r2) /
r2)y.
- u and v both have the standard normal
distribution.
Although you get two random numbers out, I usually discard
one rather than try to keep it around for the next time I need
a random number. The numbers also have the standard
distribution, which has a mean of 0 and a standard deviation
of 1. If you need a different mean and/or standard deviation,
it is sufficient to scale and bias the number obtained.
Coming soon
So far we've seen how to generate random
distributions with a single variable. In part 2, we'll look at
generating random collections of elements, such as
permutations, combinations and subsets.
|