Files, Mutable State, and Resources

The most contentious topic in programming is state, or rather how program state should be managed by the programmer(s). The state of a program is the sum total of all a program’s input data and all of the code acting upon that data. When the state is fixed and unchanging, such as with a variable is defined and never updated: we say that the state of a variable is immutable. By contrast, if an object in the language may be updated, we say that the programmer mutated the state of objects in the program.

Programming languages are typically organized in terms of the programming paradigm they inherit ideas from. One of the key ways to distinguish one programming paradigm from another is by analyzing how state gets managed in languages based on that paradigm. In a functional paradigm (C211, C311), state is minimized or eliminated entirely—states do not change, but rather are copied. Java is the quintessential object-oriented language (I311), where state is managed through a tree-structured hierarchy of objects and a rule set defined to answer how state changes and who has permission to change it. In a declarative language paradigm, the programmer describes the states, or describes what they want to occur—and the language “figures out” how to apply those changes (e.g. I308, or later this semester when we discuss SQL).

The Python language is multi-paradigm—implementing aspects of object-oriented and functional paradigms. But many of the concepts we’ve encountered (variables, objects, data types, object attributes, methods) come out of an object-oriented programming interpretation of the language: where the language is organized in terms of objects and methods that modify those objects. We’re also running Python on a Unix-like operating system: which is itself a giant blob of mutable disk space. When Python interacts with the operating system—including when a program reads files or writes to files—we’re inherently dealing with state.

Follow Along with the Instructor

We’ll cover some of the early points in this chapter up to the section on pure versus impure functions. Follow along for some highlights, then work through practice problems when you’re ready:

Stateless Programs

Many of the programs we’ve written up to this point have been stateless. In a stateless program: one can perfectly reason about how the program will behave since all behavior is defined inside the program itself. For example, if we create a new file for a Python script:

touch hello.py

… and add:

print("Hello World!")

How likely is it that when we run the program, we see the word "fish" printed to the console? The probability is low; so low that we might as well conclude that it is impossible. But we shouldn’t be surprised when running the program prints "Hello World!":

$ python3 hello.py
Hello World!

We know this because there is not a dependency in our program on some external data: there is no external information entering our program. No matter how many times we run this program, it should always produce exactly the same result. Let’s draw this as a graph:

graph TB
    hello.py

Since stateless programs have no dependencies, using them implies several strengths. They are:

Easy to reason about. When all the facts are available, one can induce the outputs from the inputs.
Easy to test. Since stateless programs have clear inputs and outputs, it is usually straightforward to define key behaviors and write unit tests for that behavior.

But stateless programs also come with a huge limitation: since everything is defined up front, they cannot react to anything. The only way for new data to enter a stateless program is by modifying the program. Nevertheless, these can be powerful (well-tested and easily understood) “building blocks” from which we can develop more complex programs from.

Stateful Programs: Randomness

The first narrow form of statefulness we saw was when we wrote programs that included random behavior, usually using Python’s random standard library:

import random

print(random.choice(("A", "B")))

Contrast this program with the “Hello World” program. When you run these two programs, are you certain about the outcome of one but uncertain about the other?

What makes us certain about the outcome of print, but uncertain about the outcome of random.choice? The answer is that the former was stateless, but this one is stateful (more on why in a bit). First let’s clarify something else: we could be uncertain about the random program because its behavior depends on something that we do not control: how random actually works. We can represent this dependency as an arrow (or edge) in a graph and say that the behavior of random_choice.py has a dependency on something inside of random:

graph TB
    hello.py

    random --> random_choice.py

Now let’s get more precise: Python’s random library is not actually random—its documentation is titled “Generate pseudo-random numbers”. The exact nature of random versus pseudo-random in computing is a story for another time, so for now we will elide the details and focus on this concept: Python has a pseudo-random number generator (PRNG) that produces numbers which are good enough to be used as if they were random.¹ A PRNG is based on a seed value determining how the PRNG generates new numbers. Given a particular seed, behavior is deterministic.

import random

random.seed(54321)

print(random.choice(("A", "B")))

We can therefore think of dependencies as producing a chain of cause and effect. random.seed causes random.choice to behave in a particular way, which causes the whole program to become deterministic.

graph TB
    hello.py

    random.seed --> random.choice --> random_choice.py

This answers the question we started with: random is stateful. But its internal state has a succinct definition: all observable behavior can be controlled using a seed value.² In other words: an integer controls all behavior. If we do not set the seed, then Python will pick a seed for us;³ if we do set the seed, then the program behaves as if it were stateless.

There are two takeaways:

Statefulness is sometimes hidden from us. For a random number generator: this hidden state does not matter for most day-to-day programming problems. Other times this can be a source of trouble: the unknown unknowns of programming where the dependencies between state and behavior are invisible.
Statefulness needs an escape hatch. One could use any metaphor: an escape hatch, a lever, or a switch allowing one to turn certain behaviors on or off. In this case, the seed provides an easy way for library users to control outcomes.

These elude to useful ideas when designing programs: keep internal state small and provide a means to debug, inspect, or opt out of it.

Stateful Programs: Programs Using System Resources

Now we have to acknowledge something: programs run on computers, which in turn have some limited set of resources. One type of resource (or system resource) is a file on a file system.

If one creates a new text file and a new Python script:

touch some-file.txt file_consumer.py

… and uses the open() function to reference the contents of some-file.txt in file_consumer.py:

with open("some-file.txt") as fh:
    print(len(fh.read()))

… then there is a dependency between the content of some-file.txt and the behavior of the Python script. Causing a change in the file will cause the Python script to behave differently:

graph TB
    hello.py

    random.seed --> random --> random_choice.py

    some-file.txt --> file_consumer.py

Since touch creates empty files by default, the first time we run our script then we should wee that the length of the text file content is zero:

$ python3 file_consumer.py
0

But if we put text inside the text file (e.g. with nano or code):

… then the output of the Python script is different than what it was before.

$ python3 file_consumer.py
6

Invisible Characters, Line Feeds, and Typewriters

We put 54321 inside the file, so why did we see 6 instead of 5?

In Alexander’s case, it’s because an invisible “line feed” character automatically got added at the end of their file:
54321␊
The line feed character in programming contexts is typically written “\n” (backslash n), and represents a vertical break in a string. For example, using Python to print the string: 000\n111 results in the following at the console:
>>> print("000\n111")
000
111
Many text editors automatically add line feed (LF) characters, or carriage-return line feed (CRLF) characters; depending on how an operating system interprets the ↵ Enter or ↵ Return key. We’re still dealing with this problem today because the computing pioneers from whom we inherited the universe could not agree on how typewriters should work. Some typewriters had ↵ Enter, which advanced the printing by one line; whereas other typewriters had ↵ Return, which advanced the printing by one line but also returned the printing node (called a carriage) to the left.

The behavior of the program cannot be reasoned about simply by reading the code itself: there is something outside of the code which influences how it behaves. Unfortunately this can also be a major source of unexpected behavior. What happens if we delete the file that the program depends on?

rm some-file.txt

We now have to ask: what it mean to open a file that does not exist? It depends on how the program designer chooses to handle the situation. Recall that cat concatenates file contents to the terminal: which behind-the-scenes means that cat must open a file, read it, then print it. The cat command reports that the file does not exist to STDERR:

$ cat some-file.txt
cat: some-file.txt: No such file or directory

Unix-like systems communicate the success or failure of programs through exit statuses (sometimes called return values, error codes, or exit codes), which are 8-bit unsigned integers (between 0 and 255) that represent success or failure. The convention for command-line programs like cat is that a 0 means success, and a non-zero exit status (>= 0) represents that the program was not successful:⁴

0: success
1: error
2: error (typically one which is somehow more serious than 1)
127: command not found
130: terminated wtih ^ Ctrl + C

One may inspect the exit status of a program by printing a special $? variable, which represents the exit status of the previous command.⁵ Previously, cat reported that the file was not found. Printing the exit code also shows that it returned 1:

$ echo $?
1

Python does something similar. Python (1) reports that the file was not found, (2) reports a traceback informing the program developer where in the program a problem occurred—perhaps with the hope that the developer can fix the problem:

$ python3 file_consumer.py
Traceback (most recent call last):
  File "~/file_consumer.py", line 1, in <module>
     with open("some-file.txt") as fh:
          ^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'some-file.txt'

… and (3) returns an exit status of 1:

$ echo $?
1

Let’s make our program more cat-like: show an error message on STDERR when the file is not found, then exit with a 1 code.

We can accomplish this by checking whether some-file.txt exists and handling the case where it doesn’t, then only reporting its contents when it exists.

Python’s os.path standard library has an isfile function which answers whether a file exists (or not):

+ from os.path import isfile
+
+ if not isfile("some-file.txt"):
+     print("some-file.txt: No such file or directory")
+
  with open("some-file.txt") as fh:
      print(len(fh.read()))

When we implemented rock-paper-scissors, we learned about standard error:

  from os.path import isfile
+ from sys import stderr

  if not isfile("some-file.txt"):
-     print("some-file.txt: No such file or directory")
+     print("some-file.txt: No such file or directory", file=stderr)

  with open("some-file.txt") as fh:
      print(len(fh.read()))

And now we can incorporate the sys.exit() function, which can take an integer as an argument representing an exit status:

  from os.path import isfile
  from sys import stderr
+ from sys import exit

  if not isfile("some-file.txt"):
      print("some-file.txt: No such file or directory", file=stderr)
+     exit(1)

  with open("some-file.txt") as fh:
      print(len(fh.read()))

The result is that we now have a program that can open the file and count the number of characters inside it. Or in the event the file does not exist: it signals the problem to the operating system and to the human user:

from sys import exit, stderr
from os.path import isfile

if not isfile("some-file.txt"):
    print("some-file.txt: No such file or directory", file=stderr)
    exit(1)

with open("some-file.txt") as fh:
    print(len(fh.read()))

This hints at some of the limitations that stateful programs have. If the behavior of a program depends on something like a file on an operating system, then the program itself is:

Hard to reason about. The necessary facts about how the program should behave are defined outside the program.
Harder to test. It may not be possible to enumerate all possible states, or even a representative sample among all possible inputs.
Harder to set up the tests. Testing the program first requires us to set up the files on the operating system: and the files on an operating system usually act like shared, mutable, global state.

But in spite of these challenges, the majority of useful, interesting programs have stateful behavior. They are worth the trouble, but we must incorporate defensive programming to guard against bad data which can cause results, and we must provide adequate feedback when a problem does occur. When one programs defensively, one must anticipate the set of valid inputs, or constrain them to a set of scenarios that one knows how to handle. If the contract of expected inputs is violated, one should provide feedback to the operating system and its human users.

Concept Review: Stateless, Stateful, and Defensive Programming

The three programs so far illustrate three cases.

graph TB
    hello.py

    random.seed --> random --> random_choice.py

    some-file.txt --> file_consumer.py

In stateless programs like hello.py, all data and all behavior is defined up front, making the programs easy to reason about. In stateful programs like file_consumer.py, the behavior of a program depends on something which is external to the program—in order to reason about how the program behaves, one must first reason about what its input data looked like.

Finally, we saw the statefulness of an entire program was not a binary but a continuum. In random_choice.py, we could make the program behave as if it were stateful or stateless by setting a seed. This fact illustrates the concept that we will spend the rest of the class (and possibly the rest of our careers) on: we can write programs which have aspects of both. We must be defensive and work within the bounds of what we know how to handle, and provide feedback for cases that we don’t.

We’ll spend the the rest of this chapter answering:

How do we identify where state changes in our program?
How do we build hybrid programs to safely handle external data?

Stateless parts of a program: Pure and Impure Functions

So far we defined stateless and statefulness of programs as a kind of synonym for predictability; whether all program behavior was internal or relied on some data that was external to the program. This thinking can also be applied within a program.

Examine the following program, and answer the following questions. Where does data enter the program? Of the two functions, which one deals with external state? What are the possible inputs of each function? What are the possible outputs?

def is_valid_boolean(tf: str) -> bool:
    return tf in ("True", "False")

def get_boolean() -> str:
    while True:
        choice = input("Choose True/False > ")
        if is_valid_boolean(choice):
            break
        print("Try again")
    return choice

if __name__ == "__main__":
    choice = get_boolean()
    print(is_valid_boolean(choice))

Here are some of our observations:

Data enters the program from the input() function
The get_boolean() function deals with external program state, because the input() function is called inside of it
The get_boolean() function takes no parameters. Because input() gets called, a user can type anything
The is_valid_boolean() function expects a string, but should always return a boolean
Because get_boolean() uses the is_valid_boolean() function in a while loop, its only possible outputs are the strings "True" or "False"

So despite the fact that a user of this program could write just about anything at the input() prompt, this program was defensive against uncertain inputs. The program will not progress until the user upholds their end of a contract. A user could interrupt the program with ^ Ctrl + C, but stopping the program returns control to the shell: not to some unknown intermediate program state with bad data.⁶

We can use these observations to conclude that the overall behavior of our program requires managing some unknown, external state. However, this uncertainty is managed through functions.

get_boolean() is an impure function. The function takes no arguments, but returns a string. It is impossible to know precisely what string it will output though, because that decision can only be known when the program runs.
is_valid_boolean() is a pure function. The function takes one argument, uses that argument alongside some constant data ("True", "False"), and always returns a boolean.

Notice also that pure and impure functions are good approximations of concepts we previously covered:

Pure functions are stateless;
Impure functions are stateful
Pure functions are easy to test;
Impure functions are hard to test
(unit tests we previously wrote dealt entirely with pure functions)

Pure and impure functions give us a new way to think about stateless and stateful programs. A program as a whole may need to deal with the unknowns of the real world, but we can typically decompose that uncertain behavior into parts which manage that uncertainty. Most functionality should ideally be implemented within pure functions that we can test or reason about ahead of time. As needed: we can wrap stateful behavior behind functions to check for and validate any incoming behavior.

Key Idea: Minimize State and Validate

The main takeaway of this discussion is conceptual: every boundary that a program interacts with is a potential source of uncertainty which can lead to unexpected behavior, bugs, or errors.

A strategy to manage this complexity is to be explicit about where data enters the program, validate our expectations about that data, correct course if possible or terminate the program if it is not, and provide feedback to the user on how to resolve any discrepancies.

Analyze Rock-Paper-Scissors: Stateless or Stateful?

Now that we know about state management, pure functions, and impure functions. Review the rock-paper-scissors implementation.

Where does external information enter the program?
Which functions are pure functions?
Which functions are impure functions?

from sys import stderr
from random import choice

def is_valid(raw: str) -> bool:
    return raw in ("rock", "paper", "scissors")

def beats(this: str, that: str) -> bool:
    return (this, that) in (
        ("rock", "scissors"),
        ("paper", "rock"),
        ("scissors", "paper"),
    )

def get_computer_choice() -> str:
    return choice(("rock", "paper", "scissors"))

def get_human_choice() -> str:
    while True:
        if is_valid(human := input("(rock/paper/scissors) >> ")):
            break
        print(f"Unknown {human}, try again", file=stderr)
    return human


def main():
    human = get_human_choice()

    computer = get_computer_choice()
    print(f"Computer chose '{computer}'")

    if human == computer:
        print("It's a tie!")
    elif beats(human, computer):
        print("Human wins!")
    else:
        print("Computer wins!")

if __name__ == "__main__":
    main()

Quick Python Review

Most of these syntax points are covered in the Python Cheatsheet Chapter. The following is a rapid review of syntax and concepts to get you back up-to-speed if it’s been a while.

Files as strings

Python can interact with file system using the open() built-in function. The open() function requires a mode: which can either by "r" for read or "w" for write.

In read mode (r) we have access to the .read() method:

with open("file-name-goes-here.txt", "r") as fh:
    data = fh.read()

In write mode (w) we have access to the .write() method:

with open("file-name-you-write-to.txt", "w") as fh:
    fh.write("this will go in the file\n")

Python lists and appending

Review Data Structures and Collections in the Cheatsheet Chapter

>>> some_list = []
>>> some_list
[]
>>> some_list.append("1")
>>> some_list
['1']
>>> some_list.append("2")
>>> some_list
['1', '2']

Python strings to lists: split and splitlines

Review str.split and str.splitlines.

The .split method splits a string into a list of strings using a delimiter:

>>> some_string = "A|B|C"
>>> some_string.split("|")
['A', 'B', 'C']

Whereas .splitlines is specifically designed to handle line breaks in files:

>>> file_content = "A\nB\nC\n"
>>> file_content.splitlines()
['A', 'B', 'C']

Notice that .split('\n') is not quite the same as .splitlines():

>>> file_content.splitlines()
['A', 'B', 'C']

>>> file_content.split("\n")
['A', 'B', 'C', '']

Practice

Today we’ll implement saving and loading game data. This means we need to answer four questions:

How do we represent game states?
How do we load game states from a file?
How do we parse data in that file into a Python data structure?
How do we save Python data back to a file?

01 How will we represent game history?

Let’s save game data to a text file that keeps track of human and computer choices made during each game.

We could represent this data as a table like the following:

	Human Choice	Computer Choice
Game 1:	rock	paper
Game 2:	paper	paper
Game 3:	scissors	rock
…	…	…
…	…	…

We might choose to simplify this table as a text file like the following:

rock,paper
paper,paper
scissors,rock

Notice:

there are no spaces in this file
data for each game is on its own line
human and computer choices are separated (delimited) by a comma ,

02 Tell “git” to “ignore” the history file

The game-history.txt file is volatile: it will change every time we play the game.

Add the file to your .gitignore, and commit the changes.

03 Load the game history

Write a function that opens the game-history.txt file, reads it, and returns a string.

def load_game_history() -> str:
    ...

Possible solution:

def load_game_history() -> str:
    with open("game-history.txt") as fh:
        return fh.read()

04 Parse game histories

Write a function that turns the raw string representation of game histories into something useful, like a list-of-lists-of-strings list[list[str]].

def parse_game_history(raw: str) -> list[list[str]]:
    ...

Hint: Remember to focus on what data the function consumes (its inputs) and what data the function produces (its output, or return value). If history is the string:

"rock,rock\nrock,rock\n"

… then the output should be a list-of-list-of-strings:

[["rock", "rock"], ["rock", "rock"]]

Possible solution:

def parse_game_history(raw: str) -> list[list[str]]:
    choices = []
    for line in raw.splitlines():
        choices.append(line.split(","))
    return choices

Alternate solution:

def parse_game_history(raw: str) -> list[list[str]]:
    return [line.split(",") for line in raw.splitlines()]

05 Save the game history

Let’s write the function to save the history by overwriting the game-history.txt file. This requires opening the file, iterating through each game in the history, and writing each game to the open file.

def save_game_history(history: list[list[str]]) -> None:
    ...

Possible solution:

def save_game_history(history: list[list[str]]) -> None:
    with open("game-history.txt", "w") as fh:
        for game in history:
            human, computer = game
            fh.write(human + "," + computer + "\n")

06 Update the history every time the game is played

Use your load_game_history(), parse_game_history(), and save_game_history() functions to update the game-history.txt file every time you play rock-paper-scissors.

Play RPS a few times. Does the game-history.txt file change each time?

Possible solution:

Here is the basic idea, assuming choices are in a human and computer variable:
# human, computer = ...

raw = load_game_history()
history = parse_game_history(raw)

history.append([human, computer])
save_game_history(history)

07 Delete the history file

Restart history from a blank slate:

rm game-history.txt

What happens when you play RPS now?

python3 rps.py

08 Handle the missing file case

Earlier we saw os.path.isfile to answer whether a file exists or not.

from os.path import isfile

if not isfile("game-history.txt"):
    ...

Update your load_game_history() function (and possibly parse_game_history()) to handle the initial case when the file does not exist.

Possible solution:

One idea is to return the empty string when the file does not exist:

from os.path import isfile

def load_game_history() -> str:
    if not isfile("game-history.txt"):
        return ""

    with open("game-history.txt") as fh:
        return fh.read()

09 Tidy up, commit, and push

Commit any remaining changes if you haven’t already and push those changes to GitHub.

Footnotes

The random versus pseudorandom distinction does have one big caveat: security. Most secure computing topics are built around being able to behave randomly: or at a minimum behave in a way that is difficult for an adversary to guess. If an attacker could observe the state of a computer and be certain what would happen next: system integrity could be compromised. System randomness is therefore tiered: programs that need a good enough source of randomness (or which could benefit from seeding for reproducibility) typically use random number generation, whereas security-critical programs might use secrets for secure numbers.

Careful study of Python’s random number generator implementation would show that this explanation is still lacking. The seed value initializes the behavior inside of a Mersenne Twister which itself must maintain several kilobytes of internal state (e.g. see Numpy Mersenne Twister MT19937). Nevertheless, we alide this point since PRNG state is determined from the seed.

When a seed is not provided, most random number generators will automatically set the seed based on the operating system’s clock. Nevertheless, the dependency on “when a program is ran” versus “what result the program produces” is typically considered to be an implementation detail which one should mitigate against relying on.

⁴

Mendel Cooper (2014), “Advanced Bash-Scripting Guide”, Appendix E. Exit Codes with Special Meanings. Accessed 2024-06-22, Online: https://tldp.org/LDP/abs/html/exitcodes.html

⁵

“Bash Reference Manual”. Chapter 3.4.2, “Special Parameters”. Accessed 2024-06-22, Online: https://www.gnu.org/software/bash/manual/html_node/Special-Parameters.html

⁶

I don’t want to give the impression that SIGINT is magic and magically stops a program. The Python interpreter is designed to always be listening for the SIGINT; if the interpreter receives it in the middle of execution, then the interpreter must clean up and free memory in order to gracefully shut down. But this hints at a challenge: programming languages are complex, so there are cases that Python can handle and those it cannot. If Python crashes or is interrupted during certain operations—including writing to files—unexpected behaviors are possible.

An Introduction to Information Infrastructure II