Structured Data I: CSV
Previously: we represented the data inside of files as strings (str
) or as lists of strings (list[str]
). This is a reasonable place to start when the data being stored are simple.
Imagine you’re writing a grocery list. You need milk, cheese, and eggs—so perhaps you write each item you need on its own line in a text file:
milk
cheese
eggs
… then our two steps where we first read
and then splitlines
loads the content into the convenient list[str]
format for us to do additional processing on:
>>> with open("grocery-list.txt") as fh:
... print(fh.read().splitlines())
...
['milk', 'cheese', 'eggs']
For our own personal use: this grocery list and this grocery list structure is enough. Here: structure refers to how the data is represented in the programming language (step 1 in the function design recipe).
But often in our roles as informaticians, scientists, engineers, entrepreneurs, and everyone in between: we must represent data at multiple levels. For example: each item on our grocery list has a “location” or “aisle number” at a grocery store.
Now there are two facts we have to represent: (1) the name of each item, (2) an aisle number. This suggests that we might organize our data into a table, or as tabular data:
name | aisle |
---|---|
milk | 24 |
cheese | 23 |
eggs | 19 |
In tabular data (also: vector data), information is stored in rows and columns. Each row (horizontal) represents one item, and each column (vertical) represents some property, attribute, or fact about that item. In the table above: each row represents a thing with a name and an aisle.
Since each row represents one item: adding another row to a set of tabular data represents adding a new item.
name | aisle |
---|---|
milk | 24 |
cheese | 23 |
eggs | 19 |
chicken noodle soup | 6 |
If one needs to represent additional properties of each item: one does so by adding additional columns. For example, if each row in the table represents an item we need, we might also specify a “quantity” for each item:
name | aisle | quantity |
---|---|---|
milk | 24 | 1 gallon |
cheese | 23 | 1 bag |
eggs | 19 | 1 carton |
chicken noodle soup | 6 | 2 can |
But wait, didn’t we just say that in tabular data, each row should represent a single item? If we have two cans of soup: surely those cans are made up of different molecules. So shouldn’t we have represented our data like this:
name | aisle |
---|---|
milk | 24 |
cheese | 23 |
eggs | 19 |
chicken noodle soup | 6 |
chicken noodle soup | 6 |
We’ll call this the hard problem of representation. From the perspective of the person buying groceries, the previous two tables represent the same information, and the difference lies in how the shopper interprets the data.
Both representations can be correct in different scenarios. For example: what if one were shopping for multiple people? Then perhaps we would choose the second option, but also add a column “for whom” we are getting the item for.
name | aisle | for |
---|---|---|
milk | 24 | Bob |
cheese | 23 | Bob |
eggs | 19 | Alice |
chicken noodle soup | 6 | Alice |
chicken noodle soup | 6 | Bob |
Our observations so far show that data representation is another design decision: there is rarely a strictly correct or incorrect way to represent data—only tradeoffs. These choices and tradeoffs were anticipated by the function design recipe, and are the subject of study in computer algorithms (C455), data science (I399), and information representation (I308).
In this chapter, we preface some of these problems, but focus on:
- How do we represent tabular data on a computer?
- How do work with tabular data in Python?
- How do we pose questions to tabular data, then implement algorithms to answer them? - i.e. statistics
Follow Along with the Instructor
We’ll highlight some key points, demonstrate their usage, and implement problems together at the end.
Tabular data on computers
Structured data in programming refers to data that is organized in a specific, predictable format. Tabular data (with rows and columns) is an instance of structured data where the total number of rows represent the number of objects, and the number of columns represents the number of properties known about each object. This kind of data is so common across science and business that you may already be familiar with tools like Microsoft® Excel®. Excel stores tabular data in a .xlsx
file.
Have you ever tried to open a .xlsx
file in a text editor? It looks like a mess:
PK^C^D^T^@^H^H^H^@<CB>T<D9>X^@^@^@^@^@^@^@^@^@^@
^@^^@^X^@^@^@xl/drawings/drawing1.xml<9D><D0>]
n<C2>0^L^G<F0>^S<EC>^NU<DE>iZ^X^SC^T^<D0>N0^N<E0>
%n<91><8F><A3><DC>~<D1>J6i{^A^^m<CB>?<F9><EF><CD>
nt<B6><F<F8>Db^S|#<EA><B2>^R^Ez^U<B4><F1>]#^N<EF>
This is because the .xlsx
is a binary file format. In a binary format, the data has been encoded into a special representation, and must be decoded before it is possible to use the data. Therefore: programmers must have specialized knowledge to work with these files, or must rely on specialized code in order to use them.
The alternative to a binary file format is a plain text format. Every file1 we’ve used so far—text files, Python scripts, HTML documents, CSS rules—all have a plain text encoding that any text editor (nano, Visual Studio Code) or programming language can read or write.
By convention: each row from a tabular data set is stored one a line of a file, and each column is separated (or delimited) by a character between each value. The most common choice for the separator character is a comma, which leads to the comma separated value (CSV) format:
name,aisle
milk,24
cheese,23
eggs,19
Reading CSVs in Python
Python’s standard library includes a csv
module for reading and writing CSV files.
As usual with the standard libraries we’ve seen previously, we access csv
in a Python script by importing it:
import csv
The knowledge from last chapter is still relevant: we still need to open
the file in order to use its contents:
with open("grocery-items.csv", "r") as csvfile:
...
But now we will invoke the csv.DictReader
to parse this file from its plaintext representation into Python data structures:
with open("grocery-items.csv", "r") as csvfile:
reader = csv.DictReader(csvfile)
...
The DictReader
is iterable, meaning we can use a for
-loop to get each item out of the object. It assumes that the first row of the file is the header, and each iteration represents a row of data. Iterating and printing each row:
with open("grocery-items.csv", "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
… results in something like this:
$ python3 csv_practice.py
{'name': 'milk', 'aisle': '24'}
{'name': 'cheese', 'aisle': '23'}
{'name': 'eggs', 'aisle': '19'}
Each row is therefore a dictionary, and the value associated with a particular column is therefore a key in that dictionary. The DictReader
therefore transforms the content inside a CSV file into a list of dictionaries, where each dictionary maps a string to a string: list[dict[str, str]]
.
Now that we’ve seen the steps, we can conclude that loading a CSV in Python can be done in two lines of code:
with open("grocery-items.csv", "r") as csvfile:
data = list(csv.DictReader(csvfile))
Spaces are data too: don’t use them unless you want them
Let’s compare this CSV ✅:
X,Y 0,1
… against this CSV ❌:
X, Y 0, 1
… when both are read by a
csv.DictReader
:import csv with open("example-file.csv") as csvfile: print(list(csv.DictReader(csvfile)))
In the first case, the dictionary keys are
X
andY
:[{'X': '0', 'Y': '1'}]
For the second case: whoever made this CSV file put spaces between each header and value. Those spaces are therefore considered to be part of the data:
# vvvvv vvvvv [{'X': '0', ' Y': ' 1'}]
Writing CSVs in Python
The read step showed us Python’s list[dict[str, str]]
representation for tabular data. The inverse problem occurs when we want to write information from one of Python’s data structures back to a file (sometimes called serialization).
This time, let’s skip to how the final result looks:
import csv
groceries = [
{'name': 'milk', 'aisle': '24'},
{'name': 'cheese', 'aisle': '23'},
{'name': 'eggs', 'aisle': '19'},
]
with open("grocery-items.csv", "w") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=["name", "aisle"])
writer.writeheader()
writer.writerows(groceries)
… and make some observations assuming that our goal is to write the data contained in the groceries
list-of-dictionaries to a grocery-items.csv
file:
- a file is opened in write mode
- a
csv.DictWriter
is initialized - that writer takes a
fieldnames
parameter, which in this case corresponds to["name", "aisle"]
since those are the names of the columns - that writer has a
.writeheader()
method, which (surprise!) writes the header - the writer has a
.writerows
method, which takes the list of dictionaries and writes all of them to the file
Summary: read and write functions
Following our key idea from the previous chapter, reading and writing with files represents a boundary between the code and the file system—let’s set the stage for managing state by turning the read and write steps into functions.
Assuming we have a CSV file named grocery-items.csv
on our computer:
name,aisle
milk,24
cheese,23
eggs,19
… then a read_groceries()
function might be defined like:
import csv
def read_groceries() -> list[dict[str, str]]:
with open("grocery-items.csv", "r") as csvfile:
return list(csv.DictReader(csvfile))
… and an opposite write_groceries()
should take a list[dict[str, str]]
and write the data structure back to a file on the operating system:
def write_groceries(groceries: list[dict[str, str]]) -> None:
with open("grocery-items.csv", "w") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=["name", "aisle"])
writer.writeheader()
writer.writerows(groceries)
Using and analyzing tabular data in Python
Now that we have seen a fairly generic way to to read and write tabular data files in Python: everything else we might need to do can be expressed using programming concepts we’re already familiar with.
Adding a new row. Since we’ve represented data as a list of dictionaries, we can add a new row by appending a dictionary to the list.
data = read_groceries()
data.append({"name": "bread", "aisle": "2"})
write_groceries(data)
Getting the aisles. What if we just wanted to know which aisles we needed to visit on our next grocery store visit? We can accomplish that by iterating through each row, pulling out the 'aisle'
key, and returning a list.
groceries = read_groceries()
aisles = []
for row in groceries:
aisles.append(row['aisle'])
# ['24', '23', '19']
Side note: getting aisles with list comprehension
If you’ve familiar with list comprehension (or its discrete math counterpart: set-builder notation), then the idea from these loops can be compressed into:
groceries = read_groceries() aisles = [row['aisle'] for row in groceries]
Python Review: lists, dictionaries, sorting, stripping
As always, the concepts and goals are more important than anything in a particular language. But again we’ll briefly review some of Python’s functions and methods that will be helpful for this chapter. The Data Structures and Collections portion of the cheat sheet is particularly relevant here.
sort versus sorted
Do you remember the difference between the function sorted()
and the method .sort()
? We had a vocabulary word in the last chapter to now explain the difference sorted()
returns a sorted copy of a data structure, .sort()
mutates a list: permanently changing it.
>>> lst = [4, 2, 5, 1, 3]
>>> sorted(lst)
[1, 2, 3, 4, 5]
>>> lst
[4, 2, 5, 1, 3]
… versus:
>>> lst = [4, 2, 5, 1, 3]
>>> lst.sort()
>>> lst
[1, 2, 3, 4, 5]
Sorting footguns: remember your data types
Sorting only makes sense for certain data types: usually integers (
int
), or strings (str
).For strings, sorting (sort of) means alphabetizing:
>>> sorted(['apple', 'ardvark', 'angel', 'alligator', 'aloe']) ['alligator', 'aloe', 'angel', 'apple', 'ardvark']
For integers, sorting means least-to-greatest:
>>> sorted([5, 3, 4, 1, 2]) [1, 2, 3, 4, 5]
If a list of strings happens to contain strings that look like numbers: the output order is sometimes called alphanumeric, where the correct order is:
1
,111
,2
.>>> sorted(['111', '1', '2']) ['1', '111', '2']
Sorting with mixed data types is ambiguous: causing Python to fail with a
TypeError
:>>> sorted([4, 'g', 3.2, 'coyote']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'str' and 'int'
Dictionaries
Review “Dictionaries” in the cheat sheet. Also recall the methods: .keys()
, .values()
, and .items()
: which return objects containing just the keys, values, and tuples of (key, value)
pairs, respectively:
>>> fruit = {"apple": 5, "orange": 3}
>>> fruit.keys()
dict_keys(['apple', 'orange'])
>>> fruit = {"apple": 5, "orange": 3}
>>> fruit.values()
dict_values([5, 3])
>>> fruit = {"apple": 5, "orange": 3}
>>> fruit.items()
dict_items([('apple', 5), ('orange', 3)])
Strings: strip, replace
We previously reviewed str.split
and str.splitlines
: methods that turned strings into lists of strings. Often there is also some cleanup we need to do within a string itself. Python’s str.strip
and str.replace
are often useful here. Each return a copy of the string, where either whitespace is removed:
>>> " A ".strip()
'A'
… or particular values are replaced:
>>> "A|B|C|D".replace("|", ",")
'A,B,C,D'
Practice
CSV is a useful interchange format for sharing data. Remember that game-history.txt
from last time? It’s already a comma-delimited file, so let’s standardize its format. Then we’ll use the CSV to compute statistics, and (maybe) use the data to implement a more-intelligent computer player.
01 From TXT to CSV
Previously our game-history.txt
looked like the following, where we had an implicit understanding that the left column represented the history of human choices and the right column represented the history of computer choices:
rock,paper
paper,paper
scissors,rock
Let’s re-write to make this explicit.
- Rename
game-history.csv
togame-history.csv
- Add
game-history.csv
to the.gitignore
- Add a header row to the file:
human,computer,winner
- Fill in values for the winner column with values: (
computer
,human
, ortie
)
As an example, the file should now look similar to this:
human,computer,winner
rock,paper,computer
paper,paper,tie
scissors,rock,computer
02 Write a function to determine: human, computer, or tie
The CSV structure we sketched out has a “winner” column. Write a function that returns human
, computer
, or tie
for any pair of inputs:
def decide_winner(human: str, computer: str) -> str:
...
Possible solution:
def decide_winner(human: str, computer: str) -> str:
"""Return 'human', 'computer', or 'tie'"""
if human == computer:
return "tie"
elif beats(human, computer):
return "human"
return "computer"
03 Rewrite loading and saving using csv
Now that we have a CSV, rewrite the load_game_history
and save_game_history
functions to use the csv.DictReader
and csv.DictWriter
, which use lists of dictionaries that map strings to strings list[dict[str, str]]
:
def load_game_history() -> list[dict[str, str]]:
...
def save_game_history(history: list[dict[str, str]]) -> None:
...
Also account for:
- We no longer need
parse_game_history
- Fix the
append
step when updating game history (we now need to append a dictionary rather than a list)
Possible Solution: loading and saving
Assuming we did
import csv
, we might write loading and saving as:def load_game_history() -> list[dict[str, str]]: if not isfile("game-history.csv"): return [] with open("game-history.csv", "r") as csvfile: return list(csv.DictReader(csvfile)) def save_game_history(history: list[dict[str, str]]) -> None: with open("game-history.csv", "w") as csvfile: writer = csv.DictWriter(csvfile, fieldnames=["human", "computer", "winner"]) writer.writeheader() writer.writerows(history)
04 Compute the number of wins and ties
Let’s work toward building a “scoreboard” for how many times the human won, the computer won, or a game resulted in a tie. A good first step would be to write a function that takes a list of final states and counts how many times each result occurred:
def sum_game_outcomes(game_outcomes: list[str]) -> dict[str, int]:
...
Here we recommend always returning values for each case, and 0 for cases we haven’t seen:
>>> sum_game_outcomes(["human", "computer"])
{'human': 1, 'computer': 1, 'tie': 0}
>>> sum_game_outcomes(["tie", "tie", "tie", "tie"])
{'human': 0, 'computer': 0, 'tie': 4}
Possible Solution:
def sum_game_outcomes(game_outcomes: list[str]) -> dict[str, int]: counts = {"human": 0, "computer": 0, "tie": 0} for outcome in game_outcomes: counts[outcome] += 1 return counts
04 Print a scoreboard
Now let’s make that dictionary more human-readable. Take a count dictionary as input and produce a scoreboard. When this function is complete, print the scoreboard at the end of each game.
def print_scoreboard(outcomes: dict[str, int]) -> None:
...
>>> print_scoreboard(sum_game_outcomes(["human", "computer"]))
human 1
computer 1
tie 0
Possible Solution:
Implement the function:
def print_scoreboard(outcomes: dict[str, int]) -> None: human, computer, tie = outcomes.values() board = f"human {human}\ncomputer {computer}\ntie {tie}" print(board)
… then somewhere in the
main()
function, select winner keys from each row and print:winners = [] for row in history: winners.append(row["winner"]) print_scoreboard(sum_game_outcomes(winners))
Or, you can rewrite the for loop into a list comprehension:
print_scoreboard(sum_game_outcomes([r["winner"] for r in history]))
Bonus: Compute how often a player chooses each action
Implement a function to estimate how often a player chooses each action: rock, paper, or scissors.
def estimate_distribution(player_choices: list[str]) -> dict[str, float]:
...
For example:
>>> estimate_distribution([])
{'rock': 0.0, 'paper': 0.0, 'scissors': 0.0}
>>> estimate_distribution(["rock"])
{'rock': 1.0, 'paper': 0.0, 'scissors': 0.0}
>>> estimate_distribution(["rock", "paper", "scissors"])
{'rock': 0.3333333333333333, 'paper': 0.3333333333333333, 'scissors': 0.3333333333333333}
>>> estimate_distribution(["rock", "rock", "paper", "scissors"])
{'rock': 0.5, 'paper': 0.25, 'scissors': 0.25}
Bonus: Intelligent computer player
Since we’re already recording player data: perhaps could leverage that data to implement a more-interesting computer player. The optimality of randomly choosing an action assumes that one has no knowledge of the opponent. But we now have a CSV of previous games and the choices each player made.
What if our opponent isn’t perfectly random: and chooses “rock” more frequently than the other two?
\(x\) | \(P(human = x)\) |
---|---|
rock | \(0.50\) |
paper | \(0.25\) |
scissors | \(0.25\) |
If you knew your opponent was more likely to choose “rock”, what should you do?
Here’s an idea: we should choose the best response according to how we expect the opponent to behave. Therefore, the computer should not pick uniformly from its choices, but should weight its guess proportional to the odds that a choice will beat the opponent. If the player chooses “rock” 50% of the time, the opponent should choose “paper” 50% of the time.
\(x\) | \(P(human = x)\) | \(P(computer = x)\) |
---|---|---|
rock | 0.5 | 0.25 |
paper | 0.25 | 0.5 |
scissors | 0.25 | 0.25 |
Rewrite the make_computer_choice()
function to take information from game-history.csv
into account when choosing the computer move. Hint: random.choices
takes a sample population and a list of weights as arguments, and returns a weighted random sample from the population. For example, calling random.choice
with weights=[0.01, 0.99]
will choose the second option 99% of the time:
>>> import random
>>> random.choices(["A", "B"], weights=[0.01, 0.99], k=1)
['B']
Footnotes
There’s a saying that “everything in Linux is a file”, so we have been working with binary files this whole time, but that fact has mostly been hidden to us.