Structured Data I: CSV

Previously: we represented the data inside of files as strings (str) or as lists of strings (list[str]). This is a reasonable place to start when the data being stored are simple.

Imagine you’re writing a grocery list. You need milk, cheese, and eggs—so perhaps you write each item you need on its own line in a text file:

milk
cheese
eggs

… then our two steps where we first read and then splitlines loads the content into the convenient list[str] format for us to do additional processing on:

>>> with open("grocery-list.txt") as fh:
...     print(fh.read().splitlines())
...
['milk', 'cheese', 'eggs']

For our own personal use: this grocery list and this grocery list structure is enough. Here: structure refers to how the data is represented in the programming language (step 1 in the function design recipe).

But often in our roles as informaticians, scientists, engineers, entrepreneurs, and everyone in between: we must represent data at multiple levels. For example: each item on our grocery list has a “location” or “aisle number” at a grocery store.

Now there are two facts we have to represent: (1) the name of each item, (2) an aisle number. This suggests that we might organize our data into a table, or as tabular data:

name	aisle
milk	24
cheese	23
eggs	19

In tabular data (also: vector data), information is stored in rows and columns. Each row (horizontal) represents one item, and each column (vertical) represents some property, attribute, or fact about that item. In the table above: each row represents a thing with a name and an aisle.

Since each row represents one item: adding another row to a set of tabular data represents adding a new item.

name	aisle
milk	24
cheese	23
eggs	19
chicken noodle soup	6

If one needs to represent additional properties of each item: one does so by adding additional columns. For example, if each row in the table represents an item we need, we might also specify a “quantity” for each item:

name	aisle	quantity
milk	24	1 gallon
cheese	23	1 bag
eggs	19	1 carton
chicken noodle soup	6	2 can

But wait, didn’t we just say that in tabular data, each row should represent a single item? If we have two cans of soup: surely those cans are made up of different molecules. So shouldn’t we have represented our data like this:

name	aisle
milk	24
cheese	23
eggs	19
chicken noodle soup	6
chicken noodle soup	6

We’ll call this the hard problem of representation. From the perspective of the person buying groceries, the previous two tables represent the same information, and the difference lies in how the shopper interprets the data.

Both representations can be correct in different scenarios. For example: what if one were shopping for multiple people? Then perhaps we would choose the second option, but also add a column “for whom” we are getting the item for.

name	aisle	for
milk	24	Bob
cheese	23	Bob
eggs	19	Alice
chicken noodle soup	6	Alice
chicken noodle soup	6	Bob

Our observations so far show that data representation is another design decision: there is rarely a strictly correct or incorrect way to represent data—only tradeoffs. These choices and tradeoffs were anticipated by the function design recipe, and are the subject of study in computer algorithms (C455), data science (I399), and information representation (I308).

In this chapter, we preface some of these problems, but focus on:

How do we represent tabular data on a computer?
How do work with tabular data in Python?
How do we pose questions to tabular data, then implement algorithms to answer them? - i.e. statistics

Follow Along with the Instructor

We’ll highlight some key points, demonstrate their usage, and implement problems together at the end.

Tabular data on computers

Structured data in programming refers to data that is organized in a specific, predictable format. Tabular data (with rows and columns) is an instance of structured data where the total number of rows represent the number of objects, and the number of columns represents the number of properties known about each object. This kind of data is so common across science and business that you may already be familiar with tools like Microsoft® Excel®. Excel stores tabular data in a .xlsx file.

Have you ever tried to open a .xlsx file in a text editor? It looks like a mess:

PK^C^D^T^@^H^H^H^@<CB>T<D9>X^@^@^@^@^@^@^@^@^@^@
^@^^@^X^@^@^@xl/drawings/drawing1.xml<9D><D0>]
n<C2>0^L^G<F0>^S<EC>^NU<DE>iZ^X^SC^T^<D0>N0^N<E0>
%n<91><8F><A3><DC>~<D1>J6i{^A^^m<CB>?<F9><EF><CD>
nt<B6><F<F8>Db^S|#<EA><B2>^R^Ez^U<B4><F1>]#^N<EF>

This is because the .xlsx is a binary file format. In a binary format, the data has been encoded into a special representation, and must be decoded before it is possible to use the data. Therefore: programmers must have specialized knowledge to work with these files, or must rely on specialized code in order to use them.

The alternative to a binary file format is a plain text format. Every file¹ we’ve used so far—text files, Python scripts, HTML documents, CSS rules—all have a plain text encoding that any text editor (nano, Visual Studio Code) or programming language can read or write.

By convention: each row from a tabular data set is stored one a line of a file, and each column is separated (or delimited) by a character between each value. The most common choice for the separator character is a comma, which leads to the comma separated value (CSV) format:

name,aisle
milk,24
cheese,23
eggs,19

Reading CSVs in Python

Python’s standard library includes a csv module for reading and writing CSV files.

As usual with the standard libraries we’ve seen previously, we access csv in a Python script by importing it:

import csv

The knowledge from last chapter is still relevant: we still need to open the file in order to use its contents:

with open("grocery-items.csv", "r") as csvfile:
    ...

But now we will invoke the csv.DictReader to parse this file from its plaintext representation into Python data structures:

with open("grocery-items.csv", "r") as csvfile:
    reader = csv.DictReader(csvfile)
    ...

The DictReader is iterable, meaning we can use a for-loop to get each item out of the object. It assumes that the first row of the file is the header, and each iteration represents a row of data. Iterating and printing each row:

with open("grocery-items.csv", "r") as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)

… results in something like this:

$ python3 csv_practice.py
{'name': 'milk', 'aisle': '24'}
{'name': 'cheese', 'aisle': '23'}
{'name': 'eggs', 'aisle': '19'}

Each row is therefore a dictionary, and the value associated with a particular column is therefore a key in that dictionary. The DictReader therefore transforms the content inside a CSV file into a list of dictionaries, where each dictionary maps a string to a string: list[dict[str, str]].

Now that we’ve seen the steps, we can conclude that loading a CSV in Python can be done in two lines of code:

with open("grocery-items.csv", "r") as csvfile:
    data = list(csv.DictReader(csvfile))

Spaces are data too: don’t use them unless you want them

Let’s compare this CSV ✅:
X,Y
0,1
… against this CSV ❌:
X,  Y
0,  1
… when both are read by a csv.DictReader:
import csv

with open("example-file.csv") as csvfile:
    print(list(csv.DictReader(csvfile)))
In the first case, the dictionary keys are X and Y:
[{'X': '0', 'Y': '1'}]
For the second case: whoever made this CSV file put spaces between each header and value. Those spaces are therefore considered to be part of the data:
#           vvvvv  vvvvv
[{'X': '0', '  Y': '  1'}]

Writing CSVs in Python

The read step showed us Python’s list[dict[str, str]] representation for tabular data. The inverse problem occurs when we want to write information from one of Python’s data structures back to a file (sometimes called serialization).

This time, let’s skip to how the final result looks:

import csv

groceries = [
    {'name': 'milk', 'aisle': '24'},
    {'name': 'cheese', 'aisle': '23'},
    {'name': 'eggs', 'aisle': '19'},
]

with open("grocery-items.csv", "w") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=["name", "aisle"])
    writer.writeheader()
    writer.writerows(groceries)

… and make some observations assuming that our goal is to write the data contained in the groceries list-of-dictionaries to a grocery-items.csv file:

a file is opened in write mode
a csv.DictWriter is initialized
that writer takes a fieldnames parameter, which in this case corresponds to ["name", "aisle"] since those are the names of the columns
that writer has a .writeheader() method, which (surprise!) writes the header
the writer has a .writerows method, which takes the list of dictionaries and writes all of them to the file

Summary: read and write functions

Following our key idea from the previous chapter, reading and writing with files represents a boundary between the code and the file system—let’s set the stage for managing state by turning the read and write steps into functions.

Assuming we have a CSV file named grocery-items.csv on our computer:

name,aisle
milk,24
cheese,23
eggs,19

… then a read_groceries() function might be defined like:

import csv

def read_groceries() -> list[dict[str, str]]:
    with open("grocery-items.csv", "r") as csvfile:
        return list(csv.DictReader(csvfile))

… and an opposite write_groceries() should take a list[dict[str, str]] and write the data structure back to a file on the operating system:

def write_groceries(groceries: list[dict[str, str]]) -> None:
    with open("grocery-items.csv", "w") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=["name", "aisle"])
        writer.writeheader()
        writer.writerows(groceries)

Using and analyzing tabular data in Python

Now that we have seen a fairly generic way to to read and write tabular data files in Python: everything else we might need to do can be expressed using programming concepts we’re already familiar with.

Adding a new row. Since we’ve represented data as a list of dictionaries, we can add a new row by appending a dictionary to the list.

data = read_groceries()

data.append({"name": "bread", "aisle": "2"})

write_groceries(data)

Getting the aisles. What if we just wanted to know which aisles we needed to visit on our next grocery store visit? We can accomplish that by iterating through each row, pulling out the 'aisle' key, and returning a list.

groceries = read_groceries()

aisles = []
for row in groceries:
    aisles.append(row['aisle'])

# ['24', '23', '19']

Side note: getting aisles with list comprehension

If you’ve familiar with list comprehension (or its discrete math counterpart: set-builder notation), then the idea from these loops can be compressed into:
groceries = read_groceries()

aisles = [row['aisle'] for row in groceries]

Python Review: lists, dictionaries, sorting, stripping

As always, the concepts and goals are more important than anything in a particular language. But again we’ll briefly review some of Python’s functions and methods that will be helpful for this chapter. The Data Structures and Collections portion of the cheat sheet is particularly relevant here.

sort versus sorted

Do you remember the difference between the function sorted() and the method .sort()? We had a vocabulary word in the last chapter to now explain the difference sorted() returns a sorted copy of a data structure, .sort() mutates a list: permanently changing it.

>>> lst = [4, 2, 5, 1, 3]
>>> sorted(lst)
[1, 2, 3, 4, 5]
>>> lst
[4, 2, 5, 1, 3]

… versus:

>>> lst = [4, 2, 5, 1, 3]
>>> lst.sort()
>>> lst
[1, 2, 3, 4, 5]

Sorting footguns: remember your data types

Sorting only makes sense for certain data types: usually integers (int), or strings (str).

For strings, sorting (sort of) means alphabetizing:
>>> sorted(['apple', 'ardvark', 'angel', 'alligator', 'aloe'])
['alligator', 'aloe', 'angel', 'apple', 'ardvark']
For integers, sorting means least-to-greatest:
>>> sorted([5, 3, 4, 1, 2])
[1, 2, 3, 4, 5]
If a list of strings happens to contain strings that look like numbers: the output order is sometimes called alphanumeric, where the correct order is: 1, 111, 2.
>>> sorted(['111', '1', '2'])
['1', '111', '2']
Sorting with mixed data types is ambiguous: causing Python to fail with a TypeError:
>>> sorted([4, 'g', 3.2, 'coyote'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'str' and 'int'

Dictionaries

Review “Dictionaries” in the cheat sheet. Also recall the methods: .keys(), .values(), and .items(): which return objects containing just the keys, values, and tuples of (key, value) pairs, respectively:

>>> fruit = {"apple": 5, "orange": 3}
>>> fruit.keys()
dict_keys(['apple', 'orange'])

>>> fruit = {"apple": 5, "orange": 3}
>>> fruit.values()
dict_values([5, 3])

>>> fruit = {"apple": 5, "orange": 3}
>>> fruit.items()
dict_items([('apple', 5), ('orange', 3)])

Strings: strip, replace

We previously reviewed str.split and str.splitlines: methods that turned strings into lists of strings. Often there is also some cleanup we need to do within a string itself. Python’s str.strip and str.replace are often useful here. Each return a copy of the string, where either whitespace is removed:

>>> "  A     ".strip()
'A'

… or particular values are replaced:

>>> "A|B|C|D".replace("|", ",")
'A,B,C,D'

Practice

CSV is a useful interchange format for sharing data. Remember that game-history.txt from last time? It’s already a comma-delimited file, so let’s standardize its format. Then we’ll use the CSV to compute statistics, and (maybe) use the data to implement a more-intelligent computer player.

01 From TXT to CSV

Previously our game-history.txt looked like the following, where we had an implicit understanding that the left column represented the history of human choices and the right column represented the history of computer choices:

rock,paper
paper,paper
scissors,rock

Let’s re-write to make this explicit.

Rename game-history.csv to game-history.csv
Add game-history.csv to the .gitignore
Add a header row to the file: human,computer,winner
Fill in values for the winner column with values: (computer, human, or tie)

As an example, the file should now look similar to this:

human,computer,winner
rock,paper,computer
paper,paper,tie
scissors,rock,computer

02 Write a function to determine: human, computer, or tie

The CSV structure we sketched out has a “winner” column. Write a function that returns human, computer, or tie for any pair of inputs:

def decide_winner(human: str, computer: str) -> str:
    ...

Possible solution:

def decide_winner(human: str, computer: str) -> str:
    """Return 'human', 'computer', or 'tie'"""
    if human == computer:
        return "tie"
    elif beats(human, computer):
        return "human"
    return "computer"

03 Rewrite loading and saving using csv

Now that we have a CSV, rewrite the load_game_history and save_game_history functions to use the csv.DictReader and csv.DictWriter, which use lists of dictionaries that map strings to strings list[dict[str, str]]:

def load_game_history() -> list[dict[str, str]]:
    ...

def save_game_history(history: list[dict[str, str]]) -> None:
    ...

Also account for:

We no longer need parse_game_history
Fix the append step when updating game history (we now need to append a dictionary rather than a list)

Possible Solution: loading and saving

Assuming we did import csv, we might write loading and saving as:

def load_game_history() -> list[dict[str, str]]:
    if not isfile("game-history.csv"):
        return []
    with open("game-history.csv", "r") as csvfile:
        return list(csv.DictReader(csvfile))

def save_game_history(history: list[dict[str, str]]) -> None:
    with open("game-history.csv", "w") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=["human", "computer", "winner"])
        writer.writeheader()
        writer.writerows(history)

04 Compute the number of wins and ties

Let’s work toward building a “scoreboard” for how many times the human won, the computer won, or a game resulted in a tie. A good first step would be to write a function that takes a list of final states and counts how many times each result occurred:

def sum_game_outcomes(game_outcomes: list[str]) -> dict[str, int]:
    ...

Here we recommend always returning values for each case, and 0 for cases we haven’t seen:

>>> sum_game_outcomes(["human", "computer"])
{'human': 1, 'computer': 1, 'tie': 0}

>>> sum_game_outcomes(["tie", "tie", "tie", "tie"])
{'human': 0, 'computer': 0, 'tie': 4}

Possible Solution:

def sum_game_outcomes(game_outcomes: list[str]) -> dict[str, int]:
    counts = {"human": 0, "computer": 0, "tie": 0}
    for outcome in game_outcomes:
        counts[outcome] += 1
    return counts

04 Print a scoreboard

Now let’s make that dictionary more human-readable. Take a count dictionary as input and produce a scoreboard. When this function is complete, print the scoreboard at the end of each game.

def print_scoreboard(outcomes: dict[str, int]) -> None:
    ...

>>> print_scoreboard(sum_game_outcomes(["human", "computer"]))
human       1
computer    1
tie         0

Possible Solution:

Implement the function:
def print_scoreboard(outcomes: dict[str, int]) -> None:
    human, computer, tie = outcomes.values()
    board = f"human      {human}\ncomputer   {computer}\ntie        {tie}"
    print(board)
… then somewhere in the main() function, select winner keys from each row and print:
winners = []
for row in history:
    winners.append(row["winner"])
print_scoreboard(sum_game_outcomes(winners))
Or, you can rewrite the for loop into a list comprehension:
print_scoreboard(sum_game_outcomes([r["winner"] for r in history]))

Bonus: Compute how often a player chooses each action

Implement a function to estimate how often a player chooses each action: rock, paper, or scissors.

def estimate_distribution(player_choices: list[str]) -> dict[str, float]:
    ...

For example:

>>> estimate_distribution([])
{'rock': 0.0, 'paper': 0.0, 'scissors': 0.0}

>>> estimate_distribution(["rock"])
{'rock': 1.0, 'paper': 0.0, 'scissors': 0.0}

>>> estimate_distribution(["rock", "paper", "scissors"])
{'rock': 0.3333333333333333, 'paper': 0.3333333333333333, 'scissors': 0.3333333333333333}

>>> estimate_distribution(["rock", "rock", "paper", "scissors"])
{'rock': 0.5, 'paper': 0.25, 'scissors': 0.25}

Bonus: Intelligent computer player

Since we’re already recording player data: perhaps could leverage that data to implement a more-interesting computer player. The optimality of randomly choosing an action assumes that one has no knowledge of the opponent. But we now have a CSV of previous games and the choices each player made.

What if our opponent isn’t perfectly random: and chooses “rock” more frequently than the other two?

\(x\)	\(P(human = x)\)
rock	\(0.50\)
paper	\(0.25\)
scissors	\(0.25\)

If you knew your opponent was more likely to choose “rock”, what should you do?

Here’s an idea: we should choose the best response according to how we expect the opponent to behave. Therefore, the computer should not pick uniformly from its choices, but should weight its guess proportional to the odds that a choice will beat the opponent. If the player chooses “rock” 50% of the time, the opponent should choose “paper” 50% of the time.

\(x\)	\(P(human = x)\)	\(P(computer = x)\)
rock	0.5	0.25
paper	0.25	0.5
scissors	0.25	0.25

Rewrite the make_computer_choice() function to take information from game-history.csv into account when choosing the computer move. Hint: random.choices takes a sample population and a list of weights as arguments, and returns a weighted random sample from the population. For example, calling random.choice with weights=[0.01, 0.99] will choose the second option 99% of the time:

>>> import random
>>> random.choices(["A", "B"], weights=[0.01, 0.99], k=1)
['B']

Footnotes

There’s a saying that “everything in Linux is a file”, so we have been working with binary files this whole time, but that fact has mostly been hidden to us.

An Introduction to Information Infrastructure II