testing

Code for comparing TDD to Property-Based Testing

marick

20 Apr 2021 • 14 min read

Hillel Wayne gave an example problem. The idea, as I understood it, was for people to solve it via property-based testing (PBT) or test-driven design and see what the differences were. In particular, I'm interested whether I missed any bugs that could have been found with PBT.

I don't know if that will happen, but I'm writing up my process so that it can be compared to Johannes Link's PBT process.

Start

commit

As is my habit, I started with a trivial test and an implementation that returns only a constant value. The test and code are in the same file for now, just for convenience.

  def can_afford?(_budget, _bill) do
    true
  end

  test "total budget always passes" do
    budget = %{total_limit: 50}
    bill = %{cost: 50}

    assert can_afford?(budget, bill)
  end

Note that I reflexively test a boundary where the bill is the same as the total_limit. Note also that I got the structure of the bill wrong: it should be a list of maps, not a map. I noticed that when writing the next test case.

Boundaries

commit

The next test might as well try both sides of the boundary:

  test "total budget boundaries" do
    bill = [%{cost: 50}]

    assert can_afford?(budget(50), bill)
    refute can_afford?(budget(49), bill)
  end

budget is just a function that avoids a little creational clutter:

  def budget(total_limit) do
    %{total_limit: total_limit}
  end

The bill is a list, which means reaching into the toolbox of mapping functions ( for is just syntactic sugar for Enum.map, and Enum.sum is just a simple use of Enum.reduce ):

  def can_afford?(budget, bill) do
    total_charge =
      (for item <- bill, do: item.cost)
      |> Enum.sum
    total_charge <= budget.total_limit
  end

As is my habit, I'm not going to bother testing different sizes of lists. It's hard to mess up use of mapping functions, so a singleton list suffices.

Counts

commit

The irregularity of some bill items having counts and some not is already bugging me, so I'll tackle counts next. Here's a bill with both types of item:

    test "bill items can have counts" do
      bill = [%{cost: 20}, %{cost: 20, count: 2}]
      assert can_afford?(budget(60), bill)
      refute can_afford?(budget(59), bill)
    end

Note that I'm maintaining the boundary cases, as that's a fine way to see if the explicit count is taken into, um, account.

For no wildly compelling reason, I split can_afford? into two responsibilities. The new work is done in lines 3-4:

  def total_charge(bill) do
    item_charges = for item <- bill do
      count = Map.get(item, :count, 1)            ##### 3
      item.cost * count                           ##### 4
    end
      
    Enum.sum(item_charges)
  end

  def can_afford?(budget, bill) do
    total_charge(bill) <= budget.total_limit
  end

Normalization

commit

I started working on categories, but was immediately bugged by another irregularity: some items have categories and some don't. I decided to create consistency in a "normalization" step. For example, every item will have a categories key, whose value may be an empty list.

Instead of forcing every item to default to a count of 1, I decided to split any item with a count of n into n items, for two reasons:

That way, there's only one place in all the code that will ever have to worry about counts.
Without thinking ahead too much, I suspected that would make later code easier. The Hillel doc's "should be as permissive as possible" requirement must mean that sometimes, given an item with count 2, the item's total cost will have to be divided up between two categories. That seems like it will come for free if I have two items.

I started by normalizing a single item into n items, removing the count field. You'll note that I began to indulge in tabular tests, the way I often do. (It looks a bit prettier in the original source because the lines can be longer.)‌

    test "normalizing a bill" do
      expect = fn bill, expected ->
        assert normalize(bill) == expected
      end
      
      %{cost: "c"}  
          |> expect.([%{cost: "c", categories: []}])
      %{cost: "c", categories: ["a"]} 
          |> expect.([%{cost: "c", categories: ["a"]}])
      %{cost: "c", categories: ["a", "b"], count: 2}
          |> expect.([%{cost: "c", categories: ["a", "b"]},
                      %{cost: "c", categories: ["a", "b"]}])
    end

The solution is, again, a straightforward mapping:‌

  def normalize(bill) when is_map(bill) do
    cost = bill.cost
    categories = Map.get(bill, :categories, [])
    count = Map.get(bill, :count, 1)

    Enum.map(1..count, fn _ ->
      %{cost: cost, categories: categories}
    end)
  end

I'm using a typical idiom, converting an item into a list of one or more items, with the idea of "flatmapping" (concatenating) them together:

  def normalize(bills) when is_list(bills) do
    Enum.flat_map(bills, &normalize/1)
  end

Notice that I'm doing (in effect) "polymorphic type dispatch." If the thing being normalized is a map, I assume it's an item. If it's a list, I assume it's a list of items. I won't be surprised if I turn the idea of "item" from a convention into a real datatype, but I'm in no hurry.

That second normalize variant was driven by this test:

    test "normalizing bills" do
      actual = normalize([%{cost: 20, count: 2}])
      assert actual == [%{cost: 20, categories: []},
                        %{cost: 20, categories: []}]
    end

Once again, I'm using a single element list because I trust that Elixir has implemented flat_map correctly.

Cleanup

commit 1 , commit 2

I noticed sloppiness and bad naming. For example, normalize used the names bill and bills when it should have used item and bill.

I decided I preferred the idea of remaining_total_after_bill to total_charge, so I changed the code to implement that. (I think I thought the first abstraction would fit later code better. That turns out to be wrong, so I should have waited until I knew the first abstraction was better.)

I also decided to put the test and product code into different files, since I was starting to have to scroll around to find things.

One category

commit

For the moment, I'll work only on debiting of category limits, without worrying about the enclosing structure with its total_limit and categories fields. So the first step is to show that an existing category limit can be debited. That is, given a category cat with limit 10, debiting 8 from it will leave the limit at 2:‌

  describe "debiting of categories" do
    test "debit of existing category" do
      actual = remaining_category_total_after_bill(%{cat: 10}, :cat, 8)
      assert actual == %{cat: 2}
    end
  end

I chose that function signature because it mimics existing Elixir Kernel functions. So the implementation is a standard Map.update:

  def remaining_category_total_after_bill(
    categories, category, bill_amount) do
  
    Map.update!(categories, category, &(&1 - bill_amount))
  end

Notice that I confused "bill" and "item" again in creating remaining_category_total_after_bill. It really updates one category limit with the cost of one bill item. Argh!

I frequently get names wrong, which is part of the reason I have so many cleanup breaks while coding. That's when I most often notice the bad names. While coding, my brain seems to treat them as arbitrary tokens.

A missing category

commit

The requirements are clear that a category named in an item might not exist in what I'm starting to think of as the "limit holder" (the map from category names to integers). To make this case visually different from the previous one, I use tabular tests again:

  test "debiting of categories" do
    expect = fn [categories, category, amount], expected ->
      actual = remaining_category_total_after_bill(categories, category, amount)
      assert actual == expected
    end

    [%{category: 10}, :category, 8] |> expect.(%{category:  2})
    [%{category: 10}, :MISMATCH, 8] |> expect.(%{category: 10})
  end

‌That requires behaving differently if the limit holder doesn't contain the category. Two test cases, two branches:

  def remaining_category_total_after_bill(
    categories, category, bill_amount) do
     
    case Map.has_key?(categories, category) do
      true ->
        Map.update!(categories, category, &(&1 - bill_amount))
      false ->
        categories
    end
  end

Cleanup, and `Item`

commit 1, commit 2

Time for cleanup. One way to avoid my constant confusion between bills and bill items is to make "item" into a proper structure with its own constructor:

defmodule HillelBudget.Item do
  defstruct [:cost, :categories]

  def item(cost, categories), 
    do: %__MODULE__{cost: cost, categories: categories}

I also moved all the normalization code into Item, because the module is all about items and lists of items.

The constructor makes creation of items in tests a bit less noisy:

                      |> expect.([item("c", ["a", "b"]),
                                  item("c", ["a", "b"])])

... and I think using the real word item instead of a literal map, and a list of HillelBudget.Items instead of a literal list of literal maps will avoid some of my bad naming.

Were you to say I should have created Item earlier, I wouldn't argue.

Normalization, again

commit

There's a certain amount of messiness with the previous test. I'm using "c" there to represent the idea of "some irrelevant cost". (I used "c" instead of "cost" to avoid line wrapping – a bad decision.) In my Midje test framework for Clojure, I gave such "metaconstants" their own notation, but I can't do that in Elixir. What I should do is something like this:

      c = 535 # This represents some random cost.

... and use c from then on. You could certainly argue that I ought to make c a variable that gets assigned different values by a PBT generator. Except I think that signals the wrong thing: that the value of c might matter to the code, whereas I want to signal that it does not.

Nevertheless, that made me realize something. I was following Hillel's examples and using strings for category names, but my previous test used atoms (symbols). In Elixir, atoms are more convenient than strings for things like keys (as well as being easier to type on my keyboard and standing out better with my code colorizer), so I decided to convert categories in the normalization step. For example:

     %{cost: c, categories: ["a"]} 
                             ^^^
                       |> expect.([item(c, [:a     ])])
                                            ^^

(I know such a conversion is not a good idea if you're coding a web server that handles floods of untrustworthy user data. But I'm not.)

Making "limit holder" more concrete

commit

I'd already been thinking of the map from category names to category limits as a "limit holder." That seems an overly-general name: why not use the word "category"? Because I'd noticed something about a budget:

%{total_limit: 5, 
  category_limits: %{ 
    category_a: 1, 
    category_b: 3
}}

The whole looks a bit like a reflection of the part. That is, the whole budget can be seen as a limit holder for the category total_limit, if you sort of squint and ignore category_limits.

So I used the more general name in a vague hope that maybe I could somehow someday cleverly use the same code for the total limit as for the category limits.

So all the category work from now on will happen in module HillelBudget.LimitHolder. I even added documentation!

@moduledoc """
  A limit-holder is a map of key-value pairs where the pairs are
  integers that represent a limit on spending. The limits may be
  negative if spending has exceeded the limit.
  """

Note, however, that I didn't make a LimitHolder named structure. The LimitHolder module just groups conventions for dealing with maps of a certain "shape."

I also renamed remaining_category_total_after_bill to LimitHolder.decrement.

In which I disappoint Bertrand Meyer

commit

Because of the requirement that can_afford? must answer "yes" if there's any possible assignment of costs to categories that keeps them all in budget, we have a classic search problem. My original intention was to just brute-force generate all possible assignments and see if any worked.

But I decided to be at least a little ambitious. If the code discovered that item five of ten in a bill made success impossible down one branch of the search tree, it shouldn't generate the remaining items since they were all guaranteed impossible.

In practical terms, that means: now that I can decrement a limit holder with data from one item, I need to figure out how to integrate decrementing with the processing of a whole bill.

I did a little fiddling around ("spiking"). It seemed more convenient if decrement(limit_holder, category, amount) either returns a decremented limit holder or the magic value :overdrawn to indicate there's no point in going on. (Essentially, this is like the Maybe or Either structure in languages that eschew nulls.)

I'd read Bertrand Meyer's Object-Oriented Software Construction (1988) when it came out, and he made a strong case that you should have separate methods for "do this!" and "what happened?" Even though I'm programming in a functional language, I felt guilty for violating his guideline, but doing that seemed right. Or at least not wrong. So:

  # Forgive me, Bertrand Meyer, but combining the update and 
  # the query seems to make handling missing keys easier.

That change is driven by a tweaked test for decrement:

  test "debiting of values" do
    expect = fn [holder, key, amount], expected ->
      assert LimitHolder.decrement(holder, key, amount) == expected
    end

    [%{key: 10}, :key,      8] |> expect.(%{key:  2})
    [%{key: 10}, :MISMATCH, 8] |> expect.(%{key: 10})

    # Overdrawing
    [%{key: 10}, :key,      10] |> expect.(%{key: 0})
    [%{key: 10}, :key,      11] |> expect.(:overdrawn)
                                           ^^^^^^^^^^
  end

And the code grows a new case:

  def decrement(holder, key, amount) do
    case Map.has_key?(holder, key) do
      true ->
        new_balance = Map.get(holder, key) - amount
        if new_balance < 0 do                   #<<<<#
          :overdrawn                            #<<<<#
        else
          Map.put(holder, key, new_balance)
        end
      false ->
        holder
    end
  end

(I'm not wild about nested cases, but I'll tolerate this one.)

Bad programmer! Bad!

Side note: As I write this blog post, I've just discovered that the last sentence of my initial comment for module LimitHolder was made false by the commit I just described:

  @moduledoc """
  A limit-holder is a map of key-value pairs where the pairs are
  integers that represent a limit on spending. The limits may be
  negative if spending has exceeded the limit.
  """

A LimitHolder may now never be decremented to a negative value. (Note: I asked Hillel if they could ever start with a negative value, and he said I should assume not.)

I'm not as hard-core as some people about shunning comments, but this is an example of their danger: saying nothing is often better than telling a truth that becomes a lie. Yes, I should be scrupulous enough to review comments after making a change. But I'm bad at both writing and reading comments when I'm in "coding mode," for the same reason, I suppose, that I'm bad at choosing names. I need a specific cleanup phase. (But that wasn't enough, because the wrong comment remains in the final commit. I probably never looked at the paragraph again. I need a cleanup checklist.)

Breadth-first

commit

Now I have to handle constructing the search tree. The easy cases are where items have zero or one categories. The test is a little awkward to read because it takes a single bill (original) and applies an item to it.

    test "the easy category cases: 0 or 1" do
      original = [%{category: 10}]
      
      expect = fn item, expected ->
        assert LimitHolder.apply_item(original, item) == expected
      end

      # cases that do not apply
      item(10, [     ]) |> expect.(original)
      item(10, [:miss]) |> expect.(original)

      # one category, and it matches
      item(10, [:category]) |> expect.([%{category: 0}])
      item(11, [:category]) |> expect.([              ])  # overdrawn
    end

I can't show the code that passes the test because I didn't commit it. Instead I wrote tests for the cases where multiple properties caused a split in the search tree:

test "the cases that require splitting" do
  original = [%{cat_a: 10, cat_b: 10}]
  
  expect = fn item, expected ->
    assert LimitHolder.apply_item(original, item) == expected
  end

  # the need to split
  item(10, [:cat_a])         |> expect.([%{cat_a:  0, cat_b: 10}])
  item(10, [:cat_b])         |> expect.([%{cat_a: 10, cat_b:  0}])
  item(10, [:cat_a, :cat_b]) |> expect.([%{cat_a:  0, cat_b: 10},
                                         %{cat_a: 10, cat_b:  0}])

  # overdrawn cases are pruned
  item(11, [:cat_a])         |> expect.([])
  item(11, [:cat_b])         |> expect.([])
  item(11, [:cat_a, :cat_b]) |> expect.([])

Now we see why :overdrawn is useful. We can prune limit holders from the growing tree without having to look inside them to see which, if any, categories are overdrawn:

  def apply_item(holders, item) do
    case item.categories do
      [] ->                           ###### 3
        holders                       ###### 4
      categories ->
        Enum.flat_map(categories, fn category ->
          for holder <- holders, do: decrement(holder, category, item.cost)
        end)
        |> Enum.reject(&(&1 == :overdrawn))    ####### 9
    end
  end

As I look at this now, I don't see why the first case (lines 3-4) is there. I suspect it's a holdover from the simpler code that handled the zero- and one-category cases.

Note again that line 9 is justified by the desire to avoid wasted work. At this point, I decided to keep putting efficiency hacks in. That's in order to give PBT more to chew on, to make it more likely to find bugs. (Ordinarily, I'd want evidence that they were needed before adding them.)

Cleanup

commit

I decided to do some cleanup. But instead of discovering the unnecessary code, I just repeated it in a different form:

  def apply_item(holders, %Item{categories: []}), do: holders

  def apply_item(holders, item) do ...

Sigh.

Plugging categories into `can_afford?`

commit

apply_item is still only called by tests. Next task is to start using it for real. First step is to sketch a new version of can_afford?

  def can_afford?(budget, bill) do
    items = Item.normalize(bill)

    total_ok = the_total_budget_is_ok(budget.total_limit, items)
    categories_ok = true    # <<<<<<<<<<<<

    total_ok && categories_ok
  end

Notice that I've changed remaining_budget_after_bill to total_budget_is_ok, my third version of the same basic concept.

This required no changes to the test because the highlighted line is just a placeholder.

Implementing `surviving_holders`

commit

categories_ok will be true whenever applying each bill item to the categories' limit holders leaves some "alive." So I wanted a function, surviving_holders, that takes some LimitHolders and a bill, does the splitting and pruning called for by the Items in a bill, returning a list of limit holders that represents all possible allocations of costs to categories (all those that don't violate any category limits).

That sounds complicated, but the hard work is already done by apply_item. I knew the function would look something like this:

  def surviving_holders(holders, items) do
    Enum.reduce(items, holders, 
      fn item, acc -> apply_item(acc, item) end)
  end

... and, indeed, that's exactly what it looks like.

Despite knowing the code would be simple, I decided to write a somewhat elaborate test to show the splitting-and-pruning logic. It has three parts. It starts with the boring case of a single-category item where the category (here, cat_a ) matches:

  describe "surviving holders" do 
    test "surviving holders" do
      original = [%{cat_a: 10, cat_b: 10}]
                    ^^^^^^^^^
      
      one = item(5, [:cat_a])
                 ^
      one_step = LimitHolder.surviving_holders(original, [one])
                                                         ^^^^^
      assert one_step == [%{cat_a: 5, cat_b: 10}]
                            ^^^^^^^^

My next step added more lines to that test (so that I could reuse the item named one), and demonstrated splitting. Here, a single LimitHolder spawns two:

      two = item(5, [:cat_a, :cat_b])
      two_step = LimitHolder.surviving_holders(original, [one, two])
      
      assert_good_enough(two_step,
        in_any_order([
          %{cat_a: 0, cat_b: 10}, # 5 subtracted twice from A
            ^^^^^^^^
          %{cat_a: 5, cat_b: 5}   # 5 taken once from A and B
                      ^^^^^^^^
        ]))

(assert_good_enough and in_any_order are from my flow_assertions package.)

Finally, I wanted to demonstrate how holders might be pruned (that is, the handling of :overdrawn). Notice below that the result that would come from subtracting 5 from cat_a three times (yielding -5) is omitted:

      three = item(5, [:cat_a, :cat_b])
      three_step = LimitHolder.surviving_holders(original, [one, two, three])
      
      assert_good_enough(three_step,
        in_any_order([
          %{cat_a: 0, cat_b: 5},
          %{cat_a: 0, cat_b: 5},
          %{cat_a: 5, cat_b: 0},
        ]))
    end
  end

It's reasonable to argue I should have written just enough of a test to force the simple implementation of surviving_holders. But this is really the only test that shows how the whole thing works. As such, it might serve as documentation. It also served as some reassurance that I really understood how everything fits together. (That is, it's what Bill Wake calls confirming tests.)

Optimization, again

commit

I made a couple of speculative optimizations to surviving_holders to see if I might introduce bugs PBT would find. See the commit for details. They just slightly complicate the code.

Connecting `surviving_holders` and `can_afford?`

commit

What remained was to force can_afford? to actually use surviving_holders. Any budget that violated a category limit without violating the total_limit would suffice. I used this:

   test "total budget boundaries" do
      budget = budget(100, a: 5)
      bill = [%{cost: 20, categories: ["a"]}]

      refute can_afford?(budget, bill)
    end

That led to replacing categories_ok = true with some real code:

  def can_afford?(budget, items) do
    items = Item.normalize(items)

    total_ok = 
      the_total_budget_is_ok(budget.total_limit, items)
    categories_ok = 
      all_category_budgets_are_ok(budget.category_limits, items)
      #^^^new^^^new^^^new^^^new^^^new^^^new^^^new^^^new^^^new^^^

    total_ok && categories_ok
  end
  
  def all_category_budgets_are_ok(category_limits, items) do
    not Enum.empty?(LimitHolder.surviving_holders([category_limits], items))
  end

At this point, I consider myself done. I have code I believe works.

One last performance hack

commit

However, the existing code checks all the categories even if the total_limit is exceeded, so that seemed worth fixing:

  def can_afford?(budget, items) do
    items = Item.normalize(items)

    if the_total_budget_is_ok?(budget.total_limit, items), 
    do: all_category_budgets_are_ok?(budget.category_limits, items),
    else: false
  end

Since there's no change to behavior, there's no need for new tests.

Tests as API documentation

commit

As a final flourish, I decided to add "doctests": runnable tests within the can_afford?'s documentation. I used the examples given in Wayne's specification. I admit that I wanted to see them work to get that final bit of confidence in my code.

Even if you rationally know you're justified in believing your process solved the problem, it's always nice to check and see.

As I typed them up, I wanted to expand on the examples a bit. I added a new case, wrote something like "... and this bill will cause can_afford? to be false..." – only to discover it actually returned true.

Rather than walk through the calculations myself, I decided to write an affordability function that would call the same functions as can_afford? but return rich results instead of funneling all the possibilities into true or false.

So affordability would return either this:

{false, "Total limit is exceeded by #{-remaining}."}

... or this:

{false, "There is no way to avoid *some* category's limit being exceeded."}

... or this:

{true, "Total budget remaining: #{remaining}",
       "Possible category budget allocations follow",
        surviving_holders}

... so it would be more clear why what I expected was what not what I got.

This was just a matter of replacing code that had returned true or false with code that returned the above annotated values.

Then I wrote can_afford? in terms of affordability:

  def can_afford?(budget, items) do
    case affordability(budget, items) do
      {false, _} -> false
      {true, _, _, _} -> true
    end
  end

I want to call this out because boolean functions are dangerous. They funnel down a vast number of possibilities into two values. It's very easy for an incorrect intermediate result to produce the correct final result just because of (bad) luck.

For that reason, it's better to test interior functions like affordability and separately test the reduction of detail to true or false than it is to test only through the narrow boolean interface.

I converted my tests to do that. If you look at the complete test suite, almost none of it is "about" can_afford?.

The final version

I did a bit of cleanup and flailing around with documentation. None of it is worth describing. But the final result is here.

Start

Boundaries

Counts

Normalization

Cleanup

One category

A missing category

Cleanup, and Item

Normalization, again

Making "limit holder" more concrete

In which I disappoint Bertrand Meyer

Bad programmer! Bad!

Breadth-first

Cleanup

Plugging categories into can_afford?

Implementing surviving_holders

Optimization, again

Connecting surviving_holders and can_afford?

One last performance hack

Tests as API documentation

The final version

Sign up for more like this.

Cleanup, and `Item`

Plugging categories into `can_afford?`

Implementing `surviving_holders`

Connecting `surviving_holders` and `can_afford?`