Refactoring & Testing: The Scientific Method of Software

Today we explore a challenge that we’ll spend the rest of the course touching in some way or another: code is a living, breathing document. The software must remain pliable—soft enough to be extended or repaired. Dead letters have a place: hardware which lacks an ability to change can provide a standard for people to work or design against. But this requires the bugs to be features: the hardware may contradict very real facts about the world, but we are forced to live with them.¹

How do we know anything works in the first place?
How do we know this version works the same as the previous version?
How do we safely extend the software without breaking it?

In order to create the new: we need something which can be both safe and principled—a scientific system of work.²

In the previous lesson we reviewed functions as our fundamental unit of abstraction: giving a name to an operation defined purely in terms of its input and its output. In the lesson before that we introduced git as a tool to version and “time travel” to any point in history. Today we will try and answer all three questions by exploring another property of functions: they are amenable to automated testing.

Follow Along with the Instructor

We’ll talk through most of the material from this chapter. Along the way this continues to demonstrate Unix, VS Code, Git, and Python.

From Manual to Automated Testing

Imagine you are tasked with the following:

Create a Python script: sketching.py. Inside of it, define a function: my_sum(lst: list[int]). This function returns the sum of all integers in the list.
>>> my_sum([])
0
>>> my_sum([1])
1
>>> my_sum([1, 2, 3])
6

The function signature and input-output pairs provide enough information to get started, since \(1 + 2 + 3 = 6\). So maybe you mash the keyboard for a few minutes and come up with something like this:

def my_sum(lst: list[int]) -> int:
    """Return the summation of `lst`, like `sum(lst)`"""
    out = 0
    for x in lst:
        out += x
    return out

Since you’re following a How to Design Programs to the function design recipe, you manually tested this function in one of two ways.

(1) You added a main block at the end of sketching.py:

if __name__ == "__main__":
    print(my_sum([]))
    print(my_sum([1]))
    print(my_sum([1, 2, 3]))

… meaning that running the script produces something comparable to:

$ python3 sketching.py
0
1
6

(2) Or perhaps you manually tested the function by importing the function from sketching, then interacted with the REPL to see whether input-output pairs were consistent with the instructions:

>>> from sketching import my_sum
>>> my_sum([])
0
>>> my_sum([1])
1
>>> my_sum([1, 2, 3])
6

by passing sample inputs into it and observing whether all of them met your expectations.

>>> my_sum([])
0
>>> my_sum([1])
1
>>> my_sum([1, 2, 3])
6

For both cases (1) and (2), there was an expectation about the expected result of the function, which was verified by running the code.

Here’s an idea: let’s automate this testing process as its own script. If we put testing code in a new Python script, test_sketching.py, every time we run the code we will get near-instant feedback for whether our code behaves the way we expect it to.

$ touch test_sketching.py

We can automate our key ideas with the following: the test script refers to a function in sketching.py, and asserts whether the output of each function is equal to an expected result:

from sketching import my_sum

assert my_sum([]) == 0
assert my_sum([1]) == 1
assert my_sum([1, 2, 3]) == 6

If all of our functions work, then it should look like nothing happens (because every assert succeeded):

$ python3 test_sketching.py

… but if we were to add a failing test, a test that we expect should fail:

  from sketching import my_sum

  assert my_sum([]) == 0
  assert my_sum([1]) == 1
  assert my_sum([1, 2, 3]) == 6
+ assert my_sum([2, 2]) == 5

… then then the traceback will confirm our expectation that \(2 + 2\) does not in fact equal 5:

$ python3 test_sketching.py
Traceback (most recent call last):
  File "~/hayesall/test_sketching.py", line 6, in <module>
    assert my_sum([2, 2]) == 5
AssertionError

This example demonstrates the key concepts behind unit testing. In unit testing, one isolates parts of the whole source code into discrete units of expected behavior, often at the level of functions. Those units are tested with more code: functions that answer whether other functions are behaving as expected.

Notice we use the term “expected behavior”, and not “correct behavior”. Running unit tests can answer whether outputs match: not whether the unit tests are correct in the first place. “Correctness” is a mathematical fact requiring proof (by induction, by contradiction, etc.), and a finite list of facts does not guarantee correctness when an input space is infinite. But mathematical correctness is rarely the goal: code is designed to model things which occur in the world, and many things in the world have fuzzy edges which lack clear definitions.

But despite what may seem like surface-level weaknesses: unit testing has shown itself to be sufficient at handling the questions we started with:

How do we know anything works in the first place? - we enumerated our expectations in code: if the tests pass, then it probably works
How do we know this version works the same as the previous version? - we ask whether the tests still pass
How do we safely extend the software without breaking it? - we monitor the state of our tests over time, and avoid making changes that break our tests

(🦉) Build your own testing framework

Draw the owl. Intermediate or advanced students may be interested drawing the owl here for a transparent view into how testing works—or this can remain opaque and one can skip directly to unit testing usage in Python.

Many programming languages build unit testing into the standard library—in Python this is the unittest library—or recommend to implement testing using third-party libraries. Before jumping into “How to use unit testing”, we’ll offer a chance to “Build your own testing framework” using only the core language.

(🦉) Setting a goal

There are two stakeholders: (A) people who write tests, and (B) people who run the tests. Group (B) is likely interested in questions like: did everything work? what didn’t work? where is the problem? did a change break the test? The needs of group (A) must be contended with, but the “usability” of a unit testing framework is something we’ll ponder when actually writing the code.

We’ll meet their needs with a text user interface, perhaps like the output shown in the following listing. From this outside view of how the program works—we see that it shows some statistics about how many tests passed, failed, or raised an exception; then it isolates a specific test that failed—how might you implement this?

$ python3 test_sketching.py
3/4 passed
1/4 failed
0/4 raised

failed: my_sum([]) == 1, got: 0

Answer the following questions; most will be discussed in the text itself.

Question 1: Starting from the function design recipe, what is the essential data being represented in this problem?
Question 2: What is the difference between a passed test, a failed test, and a test which raised an exception? Why might one be interested in each of these?
Question 3: Write a function signature to take the data from question 1 and produce output statistics.
Question 4: Write a function signature taking output statistics and “visualize” it as a string: the output in the text user interface.
Question 5: What information is left out of this listing? Is this interface better or worse than the traceback created when running the assert statements?

(🦉) Problem analysis and data representation

Recalling step 1 in the function design recipe, we showed that the key information when testing was: (1) the function being tested, (2) its input, and (3) the expected output.

We already showed that we could represent the three needs using assert statements, function calls, and an equal check == in a separate test_sketching.py script:

# file: test_sketching.py
from sketching import my_sum

assert my_sum([]) == 0
assert my_sum([1]) == 1
assert my_sum([1, 2, 3]) == 6

But testing our code like this has limitations. What would happen if we added a failing test above every other tests?

  from sketching import my_sum

+ assert my_sum([]) == 1
  assert my_sum([]) == 0
  assert my_sum([1]) == 1
  assert my_sum([1, 2, 3]) == 6

When Python interprets code from the top of a file to the bottom, so it will stop executing immediately when it encounters a problem. So even if 75% of the tests would have passed, we are left with a binary observation: “it works or it does not work”. A helpful interface into this problem could be to (1) run every test, (2) compute some statistics, and (3) help the tester isolate where a problem is.

In order to compute statistics and run every function, we need to represent them in a data structure. Since we have a series of functions and their output, let’s start representing the data as a list of tuples list[tuple[...]]. Specifically: the first value in this tuple will be the function (with input) being tested, and the second will be the expected output. Being more precise, the first is a function (a Callable), and the second can be anything (Any type):

from sketching import my_sum

tests = [
    (my_sum([]), 1),
    (my_sum([]), 0),
    (my_sum([1]), 1),
    (my_sum([1, 2, 3]), 6),
]

But just a moment, there’s something subtly wrong with our list of tests. Perhaps Guido van Rossum didn’t read Friedman and Wise³ before inventing Python. When we define a list of tests as a list containing tuples containing function calls, every call is evaluated immediately when the list is created.

>>> from test_sketching import tests
>>> tests
[(0, 1), (0, 0), (1, 1), (6, 6)]

Because of this: we lose information. When we look at the tests list, it’s unknowable what function is being tested and what input resulted in each output. But there’s a fairly straightforward fix for our data structure: separate the name of each function from a tuple containing its arguments and the expected output:

tests = [
    (my_sum, ([],), 1),
    (my_sum, ([],), 0),
    (my_sum, ([1],), 1),
    (my_sum, ([1, 2, 3],), 6),
]

… meaning the tests list now preserves the distinction between a function and its input. A function signature that uses this data will handle tuples containing a Callable, a tuple of unknown size tuple[Any, ...], and any output depending on what the function returns Any.

>>> tests
[(<function my_sum at 0x7f2f1b8a9990>, ([],), 1),
 (<function my_sum at 0x7f2f1b8a9990>, ([],), 0),
 (<function my_sum at 0x7f2f1b8a9990>, ([1],), 1),
 (<function my_sum at 0x7f2f1b8a9990>, ([1, 2, 3],), 6)]

(🦉) Implement the tester

Now that we have a list of tests we can implement a function that takes this list of tests; iterates through them while unpacking the function, its arguments, and its expected value; then performs evaluation to check whether expectations are met:

from typing import Any, Callable
from sys import stderr

# ...

def run_tests(tests: list[tuple[Callable, tuple[Any, ...], Any]]) -> None:
    for (func, args, expect) in tests:
        if (reality := func(*args)) != expect:
            print(f"expected {expect}, got {reality}", file=stderr)


if __name__ == "__main__":
    run_tests(tests)

For now we’ve only printed the case where the expected output was not the same as the actual output:

$ python3 test_sketching.py
expected 1, got 0

This informs us that one of the tests failed, but does not tell us which function nor which arguments caused the failure. We’ll remedy this by creating a string showing what was tried and what the result was:

      for (func, args, expect) in tests:
          if (reality := func(*args)) != expect:
-             print(f"expected {expect}, got {reality}", file=stderr)
+             argstr = ",".join(map(str, args))
+             call = f"{func.__name__}({argstr}) == {expect}"
+             print(f"failed: {call}, got: {reality}", file=stderr)

$ python3 test_sketching.py
failed: my_sum([]) == 1, got: 0

Since everything that does not fail passes, then we can visually inspect the results to see that one test failed, and the remaining three passed:

      for (func, args, expect) in tests:
          if (reality := func(*args)) != expect:
              # ...
+         else:
+             print("passed", file=stderr)

$ python3 test_sketching.py
failed: my_sum([]) == 1, got: 0
passed
passed
passed

(🦉) Test statistics and output handling

We’re 90% of the way to our goal since we can visually inspect the output to arrive at the solution we wanted. The final step is therefore a matter of output formatting. Instead of printing one output per line, let’s incorporate a new data data structure where we can accumulate intermediate information to while the for loop runs:

result = {"passed": 0, "failed": 0, "messages": []}

In plain speak, the result we’re interested is the statistics for how many tests passed, failed, and a set of messages to inform the user which tests were problematic. With the dictionary initialized at the start of the function, all that remains is to update it inside the function by incrementing incrementing one of the numbers, or appending failure messages. Finally, we will compute the total number of tests as the sum of passed tests and failed tests, and show all of the messages:

def run_tests(
    tests: list[tuple[Callable, tuple[Any, ...], Any]]
) -> None:
    result = {"passed": 0, "failed": 0, "messages": []}
    for func, args, expect in tests:
        if (reality := func(*args)) != expect:

            argstr = ",".join(map(str, args))
            call = f"{func.__name__}({argstr}) == {expect}"
            message = f"failed: {call}, got: {reality}"

            result["failed"] += 1
            result["messages"].append(message)
        else:
            result["passed"] += 1

    total = result["passed"] + result["failed"]
    print(f"{result['passed']}/{total} passed", file=stderr)
    print(f"{result['failed']}/{total} failed", file=stderr)
    print("\n".join(result["messages"]), file=stderr)

$ python3 test_sketching.py
3/4 passed
1/4 failed
failed: my_sum([]) == 1, got: 0

(🦉) Exercises

Notice the previous output was not exactly like the goal we set out with yet. We suggested, but did not handle, the fact that Python raises Exceptions as its primary error handling mechanism. We leave this observation as an exercise (it’s not as important how to get the error handling right, but trying to get the error handling leads to interesting corner cases that we’d like the interested reader to explore).

  $ python3 test_sketching.py
  3/4 passed
  1/4 failed
- 0/4 raised

  failed: my_sum([]) == 1, got: 0

We recommend exploring two directions: error handling while testing, and communicating tests to an end user.

Exercise 1: As written, run_tests() has multiple responsibilities: running tests and printing results. Design a separate function responsible for output handling. What are that explain() functions inputs? What should run_tests() return to make this viable?
Exercise 2: Read about json encoding and decoding in the Python standard library documentation. Write a function to serialize the result dictionary to a JSON file.
Exercise 3: JSON is a common “interchange” format to communicate data between programming languages, particularly on the web—JSON actually stands for “JavaScript Object Notation”. Write an HTML page visualizing contents in the JSON file.
Exercise 4: What currently happens in our run_tests() implementation when a function raises an exception? (e.g. if there is a raise ValueError inside of my_sum())?
Exercise 5: Adapt the run_tests() implementation to catch exceptions instead of crashing. Represent this with one more test output type: passed, failed, and raised.
Exercise 6: What if we expect a function to raise an Exception? How would you adapt the expected outputs to handle this case? How would you communicate those to the user?
Exercise 7: What if we don’t expect a function to raise an Exception, but it does? How would you adapt run_tests() to handle this case?

Python testing with unittest

The Python standard library contains a unittest module. This module contains common fixtures to define groups of test cases, and the necessary function to run tests.

The convention is that testing scripts are prefixed with test_, and contain:

# test_sketching.py
import unittest
from sketching import my_sum


class TestSummation(unittest.TestCase):
    def test_empty_list_is_zero(self):
        self.assertEqual(my_sum([]), 0)


if __name__ == "__main__":
    unittest.main()

Using a class is incidental to this exercise: you don’t really need to understand what a “class” is to infer what this code is doing.⁴ Nevertheless, we should see a few examples and practice. Let’s start remove some of the details to arrive at a listing to use as a template for testing code:

import unittest
import __________


class __________(unittest.TestCase):
    def __________(self):
        self.assertEqual(__________, __________)


if __name__ == "__main__":
    unittest.main()

The four questions of unit testing

We can fill-in-the blanks through asking a series of questions.

Question 1: What module are we testing? Our tests are inside the test_sketching.py script, and we want to test sketching.py, so we first need to import the module that is relevant to our goal:

  import unittest
+ import sketching


  class __________(unittest.TestCase):
      def __________(self):
          self.assertEqual(__________, __________)


  if __name__ == "__main__":
      unittest.main()

Question 2: What fact do we want to assert? We previously wrote a series of assert statements, like my_sum([1]) == 1. We can fill in the next two blanks using the input and output from this assertion:

  class __________(unittest.TestCase):
      def __________(self):
+         self.assertEqual(sketching.my_sum([1]), 1)

Question 3: In English, what fact are we asserting? Function names are typically verbs, and we’re writing a function that tests whether the result of calling some function is equal to something else. In this simple example, we “test summing 1 is 1”. Perhaps this seems trivial here, but the benefit of plain-English names pays off when we test more complex behaviors (e.g. “test user interface contains menu”).

  class __________(unittest.TestCase):
+     def test_summing_1_is_1(self):
          self.assertEqual(sketching.my_sum([1]), 1)

Question 4: What group of behaviors are we testing? Usually there is a logical grouping to our tests, called a TestCase. Earlier we had multiple input-output pairs that tested whether our my_sum function met expectations, so collectively we might call this a:

+ class SummationTest(unittest.TestCase):
      def test_summing_1_is_1(self):
          self.assertEqual(sketching.my_sum([1]), 1)

Putting the pieces together: With these four questions answered, we have a complete unit testing script, and can convert our remaining assertions into functions:

import unittest
import sketching


class SummationTest(unittest.TestCase):
    def test_summing_1_is_1(self):
        self.assertEqual(sketching.my_sum([1]), 1)

    def test_summing_empty_is_0(self):
        self.assertEqual(sketching.my_sum([]), 0)

    def test_summing_1_2_3_is_6(self):
        self.assertEqual(sketching.my_sum([1, 2, 3]), 6)


if __name__ == "__main__":
    unittest.main()

Running unit tests and observing behavior

Running this script from the command line now informs us that all of our tests pass:

$ python3 test_sketching.py
...
---------------------
Ran 3 tests in 0.000s

OK

Each period or dot in this context represents the result of running one of our tests. All three unit tests passed. If we repeat the approach from earlier where we intentionally add a failing test, running the tests will inform us of the existence of a failed test (indicated not with a . but with an F), and return a traceback of the problem:

$ python3 test_sketching.py
..F.
=====================
FAIL: test_summing_2_2_is_5 (__main__.SummationTest)
---------------------
Traceback (most recent call last):
  File "~/hayesall/test_sketching.py", line 16, in test_summing_2_2_is_5
    self.assertEqual(sketching.my_sum([2, 2]), 5)
AssertionError: 4 != 5

---------------------
Ran 4 tests in 0.002s

FAILED (failures=1)

Here we expected the test to fail and only included it as a demonstration for what the error looked like. Normally we’ll avoid intentionally adding failing tests like this, or if there is some important aspect about it, we might assert a negative case: assertNotEqual(4, 5).

An evolving world + test-driven development

Imagine that something changed out in the world, and now there’s a new behavior that the my_sum function needs to handle:

>>> my_sum(None)
0

Since we already have the code and the unit tests, we might follow a test-driven development workflow. In this workflow, we rearrange steps from the function design recipe and write a test before we attempt to implement the behavior.

# ...
class SummationTest(unittest.TestCase):
    # ...
    def test_summing_None_is_0(self):
        self.assertEqual(sketching.my_sum(None), 0)
# ...

Now: we only consider the code to be complete when the tests pass again. Running the tests reveals our previous code raises a TypeError instead of returning the expected result:

$ python3 test_sketching.py
..E.
=====================
ERROR: test_summing_None_is_0 (__main__.SummationTest)
---------------------
Traceback (most recent call last):
  File "~/hayesall/test_sketching.py", line 16, in test_summing_None_is_0
    self.assertEqual(sketching.my_sum(None), 0)
  File "~/hayesall/sketching.py", line 4, in my_sum
    for x in lst:
TypeError: 'NoneType' object is not iterable

---------------------
Ran 4 tests in 0.003s

FAILED (errors=1)

The traceback directs us to a problem in line 4, and informs us that None is not iterable, so the lst variable cannot be None by the time we start the for loop:

def my_sum(lst: list[int]) -> int:
    """Return the summation of `lst`, like `sum(lst)`"""
    out = 0
    for x in lst:       # <---- `for x in None`
        out += x
    return out

One fix might be to handle this as a special case with an if-statement: returning 0 immediately if lst is None and thereby never reaching the loop:

def my_sum(lst: list[int] | None) -> int:
    """Return the summation of `lst`, like `sum(lst)`"""
    if lst is None:
        return 0
    out = 0
    for x in lst:
        out += x
    return out

… which brings us back to a passing state where all the tests succeed:

$ python3 test_sketching.py
....
---------------------
Ran 4 tests in 0.001s

OK

Refactoring

Martin Fowler and Kent Beck defined refactoring as either a noun: “a change made to the internal structure of software to make it easier to understand and cheaper to modify without changing its observable behavior”, or as a verb: “to restructure software by applying a series of refactorings without changing its observable behavior.”⁵

We bring up refactoring because its definition relies on testing, or at a minimum: a means of validating behavior in order to be confident that any changes made do not break backwards compatibility.

In the specific case of my_sum, we might refactor into a recursive solution that passes all our tests using half the number of lines of code:

def my_sum(lst: list[int] | None) -> int:
    """Return the summation of `lst`, like `sum(lst)`"""
    if not lst:
        return 0
    return lst[0] + my_sum(lst[1:])

We also bring up refactoring because it emphasizes something about programming that you may not have encountered in an introductory course: programming is hard, but maintenance is harder. As we progress through this course: we will write code, but we will also have to contend with the burden created by the code we wrote earlier.

Documenting, testing, refactoring, and version controlling are each methodologies that evolved in response to the challenges people faced when they tried to maintain code bases with millions of lines. We will see a fraction of these as we progress in this class toward our final project.

Practice: unittest in rock-paper-scissors

We ended our last lesson with questions like: “How do we know if our imlementation actually works?”, or “Should we write a function or not?”. But we didn’t have the right intellectual tools to answer these questions. Let’s explore these using our new unit testing framework, starting with the is_valid(raw: str) -> bool predicate.

def is_valid(raw: str) -> bool:
    """Is the raw input a valid choice?"""
    return raw in ("rock", "paper", "scissors")

01 Prepare to work

Start from the same directory where you previously cloned your i211-starter repository.

For example, Alexander would open a new terminal session, change directory into their hayesall-i211-starter directory, then open the folder in VS Code:

$ cd i211su2024/hayesall-i211-starter
$ ls
README.md  rps.py  test_rps.py
$ code .

02 Run the tests

In the same directory as the rps.py script: there should also be a test_rps.py script. Let’s establish a baseline by running the tests.

How do you run the tests?

Possible Solution

$ python3 test_rps.py
.
--------------------
Ran 1 test in 0.000s

OK

03 Validate that rock is a valid input

Our is_valid predicate returned True or False depending on whether a string entered by the user was a valid rock-paper-scissors choice:

>>> import rps
>>> rps.is_valid("rock")
True

… therefore we might automatically test this behavior with a unit test asserting that we believe "rock" should be a valid choice. Edit your test_rps.py to look like this, and run the tests again:

from unittest import TestCase
from unittest import main as unittest_main
import rps


class ValidateHumanInputs(TestCase):
    def test_rock_is_a_valid_input(self):
        self.assertTrue(rps.is_valid("rock"))


if __name__ == "__main__":
    unittest_main()

04 Validate scissors and paper

Let’s write two more tests to handle the "paper" and "scissors" cases. How would you fill in the blanks in the following listing?

# ...
class ValidateHumanInputs(TestCase):
    def test_rock_is_a_valid_input(self):
        self.assertTrue(rps.is_valid("rock"))

    def __________________________(self):
        self.assertTrue(____________________)

    def __________________________(self):
        self.assertTrue(____________________)
# ...

Possible solution

# ...
class ValidateHumanInputs(TestCase):
    def test_rock_is_a_valid_input(self):
        self.assertTrue(rps.is_valid("rock"))

    def test_paper_is_a_valid_input(self):
        self.assertTrue(rps.is_valid("paper"))

    def test_scissors_is_a_valid_input(self):
        self.assertTrue(rps.is_valid("scissors"))
# ...

$ python3 test_rps.py
.
--------------------
Ran 3 test in 0.000s

OK

05 Stage and commit

We’ve reached a point where we have working, tested code. Since we’ve accomplished something, this is a good time to make a commit.

How do you make a commit?

Possible solution

git add test_rps.py
git commit -m "✅ Add tests for is_valid"

06 How would you test the computer choice?

Our possible RPS solution put the computer behavior into a get_computer_choice() function:

def get_computer_choice() -> str:
    """Choose from (rock, paper, or scissors)"""
    return choice(("rock", "paper", "scissors"))

This function does not take an input, but it does produce an output.

What are the expected behaviors, and how might you test those behaviors are true?

class ValidateComputerBehavior(TestCase):
    def __________________________(self):
        self.assertTrue(____________________)

Possible solution

There are three possible outputs. Even though the behavior is random: we can check that the output is one of the three expected outputs. This parallels a style called property-based testing, where we are not as interested in a specific output, but rather in some attributes or properties of those outputs, such as being one of three choices:
class ValidateComputerBehavior(TestCase):
    def test_computer_choice_in_rps(self):
        self.assertTrue(rps.get_computer_choice() in ("rock", "paper", "scissors"))

Alternate solution

Our computer player behaves randomly, which we may want to be mindful of when we test.

If we only run the get_computer_behavior() function once: then each time the tests run, it may effectively be testing different execution paths through our code. This can be problematic, leading to flaky tests: tests which might sometimes work and sometimes fail.

But here, there’s a relatively simple solution: run the tests multiple times. Choosing 10 is arbitrary here: but we might loop over multiple iterations, testing that the computer choice chooses something valid all 10 times:
class ValidateComputerBehavior(TestCase):
    def test_computer_choice_in_rps(self):
        for _ in range(10):
            self.assertIn(
                rps.get_computer_choice(), ("rock", "paper", "scissors")
            )

07 Run the tests and commit

If our tests pass:

$ python3 test_rps.py
....
--------------------
Ran 4 test in 0.000s

OK

We’ve accomplished something: making it a good time to stage and commit our changes.

How do you commit?

git add test_rps.py
git commit -m "✅ Add tests for computer choice"

08 How would you test that rock beats scissors?

We showed in our possible RPS solution that the core X-beats-Y behavior could be placed in a beats(this, that) function:

def beats(this: str, that: str) -> bool:
    """Does `this` beat `that`?"""
    return (this, that) in (
        ("rock", "scissors"),
        ("paper", "rock"),
        ("scissors", "paper"),
    )

Test this function (a TestCase class with three methods), run those tests, then stage and commit the changes.

Possible solution

class ValidateWinningCombination(TestCase):
    def test_rock_beats_scissors(self):
        self.assertTrue(rps.beats("rock", "scissors"))

    def test_scissors_beats_paper(self):
        self.assertTrue(rps.beats("scissors", "paper"))

    def test_paper_beats_rock(self):
        self.assertTrue(rps.beats("paper", "rock"))

09 Push and Release

We now have an initial implementation with tests for the core behaviors. This is a good time to create a release which we can iterate on later.

We should:

push changes to GitHub
tag a v0.1.0 release of our code
push the release to GitHub.

What git subcommand steps do we need to accomplish this goal?

Possible solution

Push the main branch:
git push
Tag the main branch at v0.1.0:
git tag -a v0.1.0 -m "Initial RPS release with implementation and unit tests"
Push the release to GitHub:
git push origin v0.1.0

Footnotes

Was the year 1900 a leap year? Microsoft® Excel® is the poster child of this problem. It mimics the look and feel of tools that came from of a long history of tabulating, accounting, bookkeeping, and data management—the spreadsheet. These tools may be as old as writing itself—the oldest human writings that we know about are tax records and accounting records—so surely it’s easy to produce a facimile of tools which are 12,000 year old? Maybe not. A bug from another spreadsheet program was considered so critical to business operations that almost every version of Excel that has ever existed will incorrectly inform users that 1900 is a leap year.⁶ Excel and programs based on it are modeled on a world that never existed, and people use it to inform high-stakes decisions: leaving the year 1900 in a kind of Shrödinger box state where it both is and is not a leap year. So the real question is: are people who use the tool knowledgeable enough about these (and similar) shortcomings to prevent catastrophe? Or is there sufficient testing in place to catch the problems when they do occur? Probably not. One does not need to look far before one finds stories on how misuse of Excel is the culprit behind losing critical COVID-19 data.⁷

To be scientific is to follow in a Newtonian or Bayesian tradition: to develop and participate in a system where we form a guess (hypothesis) about the state of the world, change something, measure it, update our beliefs based on the intervention, then integrate those beliefs into what is already known. This is not to be confused with the goings-ons of scientists, many of whom are least professional, least scientific people out there.⁸

Daniel P. Friedman and David S. Wise, “Cons Should Not Evaluate Its Arguments”. Online: https://legacy.cs.indiana.edu/ftp/techreports/TR44.pdf

⁴

But you, dear reader of footnotes, are obviously curious to learn more. The clear inspiration for Python’s unittest standard library was Java’s JUnit framework—which many people used around the same time that ideas about unit testing were growing more mainstream. Unit testing grew popular around the time that object-oriented design patterns grew popular—if one wanted to go a step further, one might even conclude that object-oriented programming paradigms caused many of the problems that required unit testing to solve.

⁵

Martin Fowler (2004-09-01) “Definition of Refactoring”. Online: https://martinfowler.com/bliki/DefinitionOfRefactoring.html

⁶

Microsoft, “Excel incorrectly assumes that the year 1900 is a leap year”. Microsoft Learn, Microsoft 365 Troubleshooting. accessed: 2024-05-02. Online: https://learn.microsoft.com/en-us/office/troubleshoot/excel/wrongly-assumes-1900-is-leap-year

⁷

Leo Kelion (2020-10-05), “Excel: Why using Microsoft’s tool caused Covid-19 results to be lost”, British Broadcasting Service (BBC), accessed: 2024-05-02. Online: https://www.bbc.com/news/technology-54423988

⁸

Richard McElreath, “Science as Amateur Software Development.” https://www.youtube.com/watch?v=zwRdO9_GGhY

An Introduction to Information Infrastructure II

Refactoring & Testing: The Scientific Method of Software

Follow Along with the Instructor

From Manual to Automated Testing

(🦉) Build your own testing framework

(🦉) Setting a goal

(🦉) Problem analysis and data representation

(🦉) Implement the tester

(🦉) Test statistics and output handling

(🦉) Exercises

Python testing with unittest

The four questions of unit testing

Running unit tests and observing behavior

An evolving world + test-driven development

Refactoring

Practice: unittest in rock-paper-scissors

01 Prepare to work

02 Run the tests

03 Validate that rock is a valid input

04 Validate scissors and paper

05 Stage and commit

06 How would you test the computer choice?

07 Run the tests and commit

08 How would you test that rock beats scissors?

09 Push and Release

Further Reading

Footnotes