'TDD for an algorithm involving randomness

I would like to try out test-driven development, but the project I am working on involves a lot of randomness and I am very unsure about how I can test it. Here is a toy example of the kind of algorithm I may want to write:

Write a function taking no argument and returning a list of random integers satisfying the following properties

Each integer is between 0 and 10

The same number doesn’t appear twice

The list is of length 3 90% of the time, and of length 4 10% of the time

There is a 50% chance for the number 3 to appear

I do not need to test exact statistical distribution, but obviously I would like tests that will fail if someone completely removes the corresponding code.

I am using an external RNG that you can assume is correct, and I am pretty free in how to structure the code, so I can use dependency injection to have tests use a fake RNG instead, but I still don’t really see how that would help. For instance even if I always use the same seed for the tests, as soon as I refactor the algorithm to pick random numbers in a different order, all the tests become meaningless.

I guess that the first two points could be tested by generating many cases and checking that the constraints are satisfied, but that doesn’t really feel like TDD.

For the last two points, I’m thinking of having tests with different configurations, where for instance the 90% is either 100% or 0%, and then I can test if the length of the list is indeed 3 or 4. I guess it would work, but it seems maybe a bit weak.

Is there any guidelines or other techniques to use when using TDD to test algorithms involving randomness?

Solution 1:^[1]

There are several ways you can go about a problem like this, and I may add another answer in the future, but the approach that I immediately found most compelling would be to combine test-driven development (TDD) with property-based testing.

You can do this in many languages, with various frameworks. Here, I'm going to use the original property-based testing library, QuickCheck.

The first two requirements translate directly to predicates that QuickCheck can exercise. The latter two translates into distribution tests - a more advanced feature of QuickCheck that John Hughes explains in this presentation.

Each one in turn.

Preliminaries

Before writing the first test, you're going to set up tests and import the appropriate libraries:

module RintsProperties where

import Test.Framework (Test)
import Test.Framework.Providers.QuickCheck2
import Test.QuickCheck
import Q72168364

where the System Under Test (SUT) is defined in the Q72168364 library. The SUT itself is an action called rints (for Random INTS):

rints :: IO [Int]

Since it's going to generate random numbers, it'll have to run in IO.

Image

The first requirement says something about the image of the SUT. This is easily expressed as a property:

testProperty "Each integer is between 0 and 10" $ \() -> ioProperty $ do
  actual <- rints
  return $
    counterexample ("actual: " ++ show actual) $
    all (\i -> 0 <= i && i <= 10) actual

If you ignore some of the ceremony involved with producing a useful assertion message and such, the central assertion is this:

    all (\i -> 0 <= i && i <= 10) actual

which verifies that all integers i in actual are between 0 and 10.

In true TDD fashion, the simplest implementation that passes the test is this:

rints :: IO [Int]
rints = return []

Always return an empty list. While degenerate, it fulfils the requirement.

No duplicates

The next requirement also translates easily to a predicate:

testProperty "The same number does not appear twice" $ \() -> ioProperty $ do
  actual <- rints
  return $ nub actual === actual

nub removes duplicates, so this assertion states that nub actual (actual where duplicates are removed) should be equal to actual. This is only going to be the case if there are no duplicates in actual.

In TDD fashion, the implementation unfortunately doesn't change:

rints :: IO [Int]
rints = return []

In fact, when I wrote this property, it passed right away. If you follow the red-green-refactor checklist, this isn't allowed. You should start each cycle by writing a red test, but this one was immediately green.

The proper reaction should be to discard (or stash) that test and instead write another one - perhaps taking a cue from the Transformation Priority Premise to pick a good next test.

For instructional reasons, however, I will stick with the order of requirements as they are stated in the OP. Instead of following the red-green-refactor checklist, I modified rints in various ways to assure myself that the assertion works as intended.

Length distribution

The next requirement isn't a simple predicate, but rather a statement about the distribution of outcomes. QuickCheck's cover function enables that - a feature that I haven't seen in other property-based testing libraries:

testProperty "Length is and distribution is correct" $ \() -> ioProperty $ do
  actual <- rints
  let l = length actual
  return $
    checkCoverage $
    cover 90 (l == 3) "Length 3" $
    cover 10 (l == 4) "Length 4"
    True -- Base property, but really, the distribution is the test

The way that cover works, it needs to have a 'base property', but here I simply return True - the base property always passes, meaning that the distribution is the actual test.

The two instances of cover state the percentage with which each predicate (l == 3 and l == 4) should appear.

Running the tests with the degenerate implementation produces this test failure:

  Length is and distribution is correct: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
Only 0% Length 3, but expected 90%

As the message states, it expected 90% of Length 3 cases, but got 0%.

Again, following TDD, one can attempt to address the immediate error:

rints :: IO [Int]
rints = return [1,2,3]

This, however, now produces this test failure:

  Length is and distribution is correct: [Failed]
*** Failed! Insufficient coverage (after 400 tests):
100.0% Length 3

Only 0.0% Length 4, but expected 10.0%

The property expects 10% Length 4 cases, but got 0%.

Perhaps the following is the simplest thing that could possibly work?

import System.Random.Stateful

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  if 10 < p then return [1,2,3] else return [1,2,3,4]

Perhaps not quite as random as you'd expect, but it passes all tests.

More threes

The final (explicit) requirement is that 3 should appear 50% of the times. That's another distribution property:

testProperty "3 appears 50% of the times" $ \() -> ioProperty $ do
  actual <- rints
  return $
    checkCoverage $
    cover 50 (3 `elem` actual) "3 present" $
    cover 50 (3 `notElem` actual) "3 absent"
    True -- Base property, but really, the distribution is the test

Running all tests produces this test failure:

  3 appears 50% of the times: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
100% 3 present

Only 0% 3 absent, but expected 50%

Not surprisingly, it says that the 3 present case happens 100% of the time.

In the spirit of TDD (perhaps a little undisciplined, but it illustrates what's going on), you may attempt to modify rints like this:

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  if 10 < p then return [1,2,3] else return [1,2,4,5]

This, however, doesn't work because the distribution is still wrong:

  3 appears 50% of the times: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
89% 3 present
11% 3 absent

Only 11% 3 absent, but expected 50%

Perhaps the following is the simplest thing that works. It's what I went with, at least:

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  includeThree <- uniformM globalStdGen
  if 10 < p
    then if includeThree then return [1,2,3] else return [1,2,4]
    else if includeThree then return [1,2,3,4] else return [1,2,4,5]

Not elegant, and it still doesn't produce random numbers, but it passes all tests.

Random numbers

While the above covers all explicitly stated requirements, it's clearly unsatisfactory, as it doesn't really produce random numbers between 1 and 10.

This is typical of the TDD process. As you write tests and SUT and let the two interact, you discover that more tests are required than you originally thought.

To be honest, I wasn't sure what the best approach would be to 'force' generation of all numbers between 0 and 10. Now that I had the hammer of distribution tests, I wrote the following:

testProperty "All numbers are represented" $ \() -> ioProperty $ do
  actual <- rints
  return $
    checkCoverage $
    cover 5 ( 0 `elem` actual) " 0 present" $
    cover 5 ( 1 `elem` actual) " 1 present" $
    cover 5 ( 2 `elem` actual) " 2 present" $
    cover 5 ( 3 `elem` actual) " 3 present" $
    cover 5 ( 4 `elem` actual) " 4 present" $
    cover 5 ( 5 `elem` actual) " 5 present" $
    cover 5 ( 6 `elem` actual) " 6 present" $
    cover 5 ( 7 `elem` actual) " 7 present" $
    cover 5 ( 8 `elem` actual) " 8 present" $
    cover 5 ( 9 `elem` actual) " 9 present" $
    cover 5 (10 `elem` actual) "10 present"
    True -- Base property, but really, the distribution is the test

I admit that I'm not entirely happy with this, as it doesn't seem to 'scale' to problems where the function image is much larger. I'm open to better alternatives.

I also didn't want to be too specific about the exact distribution of each number. After all, 3 is going to appear more frequently than the others. For that reason, I just picked a small percentage (5%) to indicate that each number should appear not too rarely.

The implementation of rints so far failed this new test in the same manner as the other distribution tests.

Crudely, I changed the implementation to this:

rints :: IO [Int]
rints = do
  p <- uniformRM (1 :: Int, 100) globalStdGen
  let l = if 10 < p then 3 else 4
  ns <- shuffle $ [0..2] ++ [4..10]
  includeThree <- uniformM globalStdGen
  if includeThree
    then do
      let ns' = take (l - 1) ns
      shuffle $ 3 : ns'
    else
      return $ take l ns

While I feel that there's room for improvement, it passes all tests and actually produces random numbers:

ghci> rints
[5,2,1]
ghci> rints
[9,2,10]
ghci> rints
[8,1,3]
ghci> rints
[0,9,8]
ghci> rints
[0,10,3,6]

This example used QuickCheck with Haskell, but most of the ideas translate to other languages. QuickCheck's cover function may be an exception to that rule, since I'm not aware that it's been ported to common language implementations, but perhaps I'm just behind the curve.

In situation where something like cover isn't available, you'd have to write a test that loops through enough randomly generated test cases to verify that the distribution is as required. A little more work, but not impossible.

Solution 2:^[2]

I would like to try out test-driven development, but the project I am working on involves a lot of randomness

You should be aware that "randomness" hits TDD in a rather awkward spot, so it isn't the most straight forward "try-it-out" project.

There are two concerns - one that "randomness" is a very expensive assertion to make:

How would you reliably distinguish between this implementation and a "real" random number generator that just happens to be emitting a finite sequence of 4s before changing to some other number?

So we get to choose between stable tests that don't actually express all of our constraints or more precise tests that occasionally report incorrect results.

One design approach here it to lean into "testability" - behind the facade of our interface will be an implementation that combines a general purpose source of random bits with a deterministic function that maps a bit sequence to some result.

def randomListOfIntegers():
    seed_bits = generalPurpose.random_bits()
    return determisticFunction(seed_bits)

def deterministicFunction(seed_bits):
    ???

The claim being that randomListOfIntegers is "so simple that there are obviously no deficiencies", so we can establish its correctness by inspection, and concentrate our effort on the design of deterministicFunction.

Now, we run into a second problem: the mapping of seed_bits to some observable result is arbitrary. Most business domain problems (ex: a payroll system) have a single expected output for any given input, but in random systems you still have some extra degrees of freedom. If you write a function that produces an acceptable answer given any sequence of bits, then my function, which reverses the bits then calls your function, will also produce acceptable answers -- even though my answers and your answers are different.

In effect, if we want a suite of tests that alert when a code change causes a variation in behavior, then we have to invent the specification of the behavior that we want to lock in.

And unless we have a good guess as to which arbitrary behaviors will support a clean implementation, that can be pretty painful.

(Alternatively, we just lean on our pool of "acceptance" tests, which ignore code changes that switch to a different arbitrary behavior -- it's trade offs all the way down).

One of the simpler implementations we might consider is to treat the seed_bits as an index into a sequence of candidate responses

def deterministicFunction(seed_bits):
    choices = ???
    n = computeIndex(seed_bits, len(choices))
    return choices[n]

This exposes yet another concern: k seed_bits means 2^k degrees of freedom; unless len(choices) happens to be a power of 2 and not bigger than 2^k, there's going to be some bias in choosing. You can make the bias error arbitrarily small by choosing a large enough value for k, but you can't eliminate it with a finite number of bits.

Breaking down the problem further, we can split the work into two elements, one responsible for producing the pool of candidates, another for actually choosing one of them.

def deterministicFunction(seed_bits):
    return choose(seed_bits, weighted_candidates())

def choose(seed_bits, weighted_candidates):
    choices = []

    # note: the order of elements in this enumeration
    # is still arbitrary
    for candidate, weight in weighted_candidates:
       for _ in range(weight):
           choices.add(candidate)

    # technically, this is also still arbirary
    n = computeIndex(seed_bits, len(choices))
    return choices[n]

At this point, we can decide to use "simplest thing that could possibly work" to implement computeIndex (test first, if you like), and this new weighted_candidates() function is also easy to test, since each test of it is just "count the candidates and make sure that the problem constraints are satisfied by the population as a whole". choose can be tested using much simpler populations as inputs.

This kind of an implementation could be unsatisfactory - after all, we're building this data structure of candidates, and then another of choices, only to pick a single one. That may be the best we can do. Often, however, different implementation is possible.

The problem specification, in effect, defines for us the size of the (weighted) population of responses. In other words, len(choices) is really some constant L.

choices = [ generate(n) for n in range(L)]
n = computeIndex(seed_bits, L)
return choices[n]

which in turn can be simplified to

n = computeIndex(seed_bits, L)
return generate(n)

Which is to say, we don't need to pass around a bunch of data structures if we can calculate which response is in the nth place.

Notice that while generate(n) still has arbitrary behavior, there are definitive assertions we can make about the data structure [generate(n) for n in range(L)].

Refactoring a bit to clean things up, we might have

def randomListOfIntegers():
    seed_bits = generalPurpose.random_bits()
    n = computeIndex(seed_bits, L)
    return generateListOfIntegers(n)

Note that this skeleton hasn't "emerged" from a writing out a bunch of tests and refactoring, but instead from thinking about the problem and the choices that we need to consider in order to "control the gap between decision and feedback".

It's probably fair to call this a "spike" - a sandbox exercise that we use to better understand the problem we are trying to solve.

The same number doesn’t appear twice

An awareness of combinatorics is going to help here.

Basic idea: we can compute the set of all possible arrangements of 4 unique elements of the set [0,1,2,3,4,5,6,7,8,9,10], and we can use a technique called squashed ordering to produce a specific subset of them.

Here, we'd probably want to handle the special case of 3 a bit more carefully. The rough skeleton is going to look something like

def generateListOfIntegers(n):
    other_numbers = [0,1,2,4,5,6,7,8,9,10]

    has3, hasLength3, k = decode(n)

    if has3:
        if hasLength3:
            # 45 distinct candidates
            assert 0 <= k < 45

            return [3] ++ choose(other_numbers, 2, k)

        else:
            # 120 distinct candidates
            assert 0 <= k < 120

            return [3] ++ choose(other_numbers, 3, k)
    else:
        if hasLength3:
            # 120 distinct candidates
            assert 0 <= k < 120

            return choose(other_numbers, 3, k)

        else:
            # 210 distinct candidates
            assert 0<= k < 210

            return choose(other_numbers, 4, k)

Where choose(other_numbers, j, k) returns the kth subset of other_numbers with j total elements, and decode(n) has the logic necessary to ensure that the population weights come out right.

The behavior of choose is arbitrary, but there is a "natural" order to the progression of subsets (ie, you can "sort" them), so it's reasonable to arbitrarily use the sorted order.

It's probably also worth noticing that choose is very general purpose - the list we pass in could be just about anything, and it really doesn't care what you do with the answer. Compare that with decode, where our definition of the "right" behavior is tightly coupled to its consumption by generateListOfNumbers.

You may want to review Peter Seiber's Fischer Chess Exercise, to see the different approaches people were taking when TDD was new. Warning, the threading is horribly broken now, so you may need to sift through multiple threads to find all the good bits.

Solution 3:^[3]

First of all, there are more than one approach to TDD, so there's no single right answer. But here's my take on this:

You mentioned that you don't need to test exact statistical distribution, but I think that you must. Otherwise, writing the simplest code that satisfies your tests will result in a completely deterministic, non-random solution. (If you'd look at your requirements without thinking about randomness, you'll find out that you can satisfy them using a simple loop and few if statements). But apparently, that's not what you really want. Therefore in your case, you must write a test that checks the statistical distribution of your algorithm.

Such tests needs to gather many results of your function under tests and therefore may be slow, so some people will consider it a bad practice, but IMHO this is your only way to actually test what you really care about.

Assuming that this is not only a theoretical exercise, you may also want to save the results to a file that you can later examine manually (e.g. using Excel), check additional statistical properties of the results, and potentially add or adjust your tests accordingly.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Mark Seemann
Solution 2	VoiceOfUnreason
Solution 3	Arnon Axelrod