PyMySQL and database connectors

Previously we introduced an interactive approach to working with a database. We (1) opened a terminal, (2) connected to a database, and (3) typed SQL statements and saw the results in real time.

Notice how similar this interactive database REPL is compared to the way we introduced Python. When we introduced Python, we (1) opened a terminal, (2) started the Python REPL, and (3) typed Python statements and saw the results in real time.

Each of these are excellent tools for rapid prototyping: and they should absolutely be used when one is exploring a new idea, or looking up a result. But the day-to-day tasks of a business analyst or data scientist are rarely achieved by typing one-off SQL statements into a REPL then telling people what they saw. An analyst is far more effective when they: (1) pull data from a database, (2) visualize or extract insights from it, and (3) communicate and develop actionable tactics based on it. What does that sound like? Programming!

Now if only there were some way to interact with a database directly from Python….

Hi summer students! 👋

We’ve linked a slide deck, but most of the information is also in the text below. Choose which format works best for you. The practice towards the end is not in the slides.

Slides

PyMySQL

Our programming tool of choice here is PyMySQL. This is a third-party Python package (similar to Flask and Jinja) which we will use to interact with our database. Rather than typing commands manuall, it helps us automate common tasks like: creating new tables, inserting rows, deleting data, or selecting data.

This makes PyMySQL part of a more-general class of tools called database connectors, which would be inconvenient to implement by ourselves every time. For example, running a select statement interactively in MariaDB will also visualize the data in a text-based table:

> select id, name from flowers limit 2;
+----+-----------+
| id | name      |
+----+-----------+
|  1 | Hyacinth  |
|  2 | Daylilies |
+----+-----------+

Printing a string is helpful when we’re working interactively, but when we’re programming: we typically need that data to be parsed into data structures and types that a programming language is designed around. PyMySQL uses a table’s schema to do this: meaning that integers will be integers, strings will be strings, and dates will be datetime objects (no more stringly-typed data like we had in CSVs!):

>>> curr.execute("select id, name from flowers limit 2")
>>> curr.fetchall()
[{'id': 1, 'name': 'Hyacinth'}, {'id': 2, 'name': 'Daylilies'}]

This is powerful, but this is another great power comes with great responsibility moment. We will treat a database like a really complex file, where everything we do is permanent. If we instruct the database to delete something: then that data will disappear and there is no “Undo” button.¹

Therefore: we will emphasize a specific way of working with databases, where we can easily reproduce a “safe” or “clean” or “base” state for what our application’s data must look like. If something goes wrong, we want an easy way to “reset” our system.

Connecting with PyMySQL

All the login information is still needed, but we’ll pass that information into the PyMySQL connect function. In the same way we specified our username and password at the command line:

$ mysql -h HOST -u USERNAME --password=PASSWORD -D DATABASE

MariaDB [i211u24_ebigalee]> select * from books;

… we will need to provide this same information from Python:

import pymysql

conn = pymysql.connect(
    host=DB_HOST,
    user=DB_USER,
    password=DB_PASSWORD,
    database=DB_DATABASE,
)

Before we start typing in our credentials, we should talk about a security problem. Can you foresee a problem with writing our password in plain text? Let’s pretend to be evil for a moment and discuss the following listing. 😈

import pymysql

def get_connection():
    return pymysql.connect(
        host="127.0.0.1",
        user="ebigalee",
        password="123456",
        database="db.iu.edu",
    )

An evil person 👿 reading our code will be able to steal our credentials (though Erika has other security issues too if she thinks that’s a secure password) and do who-knows-how-much damage with them. How might we fix this? Here’s an idea: what if we could move sensitive information (like passwords) into variables, then “hide” those variables somehow?

import pymysql
from flaskapp.config import DB_PASSWORD

def get_connection():
    return pymysql.connect(
        host="127.0.0.1",
        user="ebigalee",
        password=DB_PASSWORD,
        database="db.iu.edu",
    )

This step is called configuration. Often an application needs extra information to work correctly: but those are details which that we don’t want to “hard-code” into an app. Perhaps this is because we don’t want to repeat these details, or perhaps we want a central location to modify them, or perhaps there are security ramifications if the configuration leaked to that evil person we mentioned.

Here we will will move sensitive configuration variables into a config.py file:

# config.py
DB_HOST = "..."
DB_DATABASE = "..."
DB_USER = "..."
DB_PASSWORD = "..."

… make sure we’ve told git to ignore that file so people cannot read our code on GitHub to find our passwords:²

# .gitignore
config.py

And now that our config.py contains sensitive information, we will only share that file with people we trust. Otherwise, this get_connection() function now abstracts away the details for how we connect to the database:

import pymysql
from flaskapp.config import DB_HOST, DB_DATABASE, DB_USER, DB_PASSWORD

def get_connection():
    return pymysql.connect(
        host=DB_HOST,
        user=DB_USER,
        password=DB_PASSWORD,
        database=DB_DATABASE,
        cursorclass=pymysql.cursors.DictCursor,
    )

Now that we have a safe workspace, we’re ready to begin using the database from within our application.

Follow Along with the Instructor

Practice with the instructor. Not an exact replacement for the written directions. Read the text above this first before proceeding.

Set up PyMySQL and initialize database.
Apologies, in some spots I put a black box over my login info for Maria DB. If you need help logging in beyond what is in this chapter, see the weekly page in Canvas, as well as the README in your repo.

Create ‘database.py’ and ‘config.py’

Open your i211-lecture repository to follow along:

Let’s create a module to imports pymysql and handle all database interactions—start by creating the files database.py and config.py inside the flaskapp directory. Your file hierarchy should now look like this:

flaskapp
├── __init__.py
├── __main__.py
├── app.py
├── config.py
├── database.py
├── static
├── templates
└── tests

In config.py, start from this template but fill in your MariaDB database credentials (find these in Canvas, look under this week’s to-do):

DB_HOST = "..."
DB_DATABASE = "..."
DB_USER = "..."
DB_PASSWORD = "..."

In database.py add some code to use as a starter:

import csv

import pymysql
from flaskapp.config import DB_HOST, DB_DATABASE, DB_USER, DB_PASSWORD


def get_connection():
    return pymysql.connect(
        host=DB_HOST,
        user=DB_USER,
        password=DB_PASSWORD,
        database=DB_DATABASE,
        cursorclass=pymysql.cursors.DictCursor,
    )


def initialize_db():
    conn = get_connection()

    # TODO: add "create table" variables here

    with conn.cursor() as curr:
        curr.execute("drop table if exists recipes")
        curr.execute("drop table if exists supplies")
        curr.execute(_recipes)
        curr.execute(_supplies)
    conn.commit()
    conn.close()


if __name__ == "__main__":
    initialize_db()

Using PyMySQL in Five Steps

Our first goal is to complete the function initialize_db() and to do that we need to first understand the steps required when using PyMySQL to execute SQL.

How we put these steps together with the rest of our code might change slightly, but the other steps are boilerplate. This connect-handle-close process is not all that different from opening a a CSV file, accessing data using a cursor to read/write, and closing the file when done.

As you read about each step: find them in your database.py

1. Create a connection to the database

conn = get_connection()

2. Obtain a cursor

curr = conn.cursor()

3. Execute a query

curr.execute("drop table person")

Details in step 3 on what SQL we’re executing will change as needed.

4. Commit the changes

conn.commit()

A commit in a database is like a “save point”. If we change something (read, insert, delete) then we need to save those changes. Reading (select) does not need a commit, because nothing changed.

5. Close the cursor and connection

conn.close()
curr.close()

Practice: Use PyMySQL to Create Data Tables

In the initialize_db() function, we’re going to use Python to create a string containing the SQL query to create a table.

def initialize_db():
    conn = get_connection()

    # TODO: add "create table" variables here

    with conn.cursor() as curr:
        curr.execute("drop table if exists recipes")
        curr.execute("drop table if exists supplies")
        curr.execute(_recipes)
        curr.execute(_supplies)
    conn.commit()
    conn.close()

When we run this function, we:

establish a connection to the database
TODO: define “create table” statements
using the cursor:
- drop any tables that exist (to “reset” our tables)
- create the tables using the SQL we wrote earlier
commit the changes
close the connection

Context management and cursor cleanup

We said earlier that we also need to “close the cursor”, but here we use a context manager (Python’s with statement) to “self-close” the connection.

Compare this to how we introduced files, where we also used with (alongside open()) when reading or writing to files. In that situation: the with context manager also automatically closed a file when we were done with it. Without it: we would have had to explicitly call .close() on the file:
fh = open("some-file.txt")
fh.read()
fh.close()

01 Create a table for supplies in the database

To start, let’s define a variable _supplies that will hold our SQL query. The query will be formatted as a string.

    _supplies = """ put SQL here """

If we make the string using triple quotemarks, this allows us to keep the formatting for the SQL statement in place, such as the line breaks and any indentation.
The _ underscore in front of supplies means this is a variable meant to be used internally in a program, in this case, only within our database.py (if you take a course on object-oriented programming, this will make more sense. For now, just go with it. 😊)

Next we need to write the SQL command:

    _supplies = """
    create table supplies () engine=InnoDB;
    """

Then slowly add the attributes required:

    _supplies = """
    create table supplies (
        supply_name varchar(50),
        description varchar(1000)
    ) engine=InnoDB;
    """

We also need a way to uniquely identify each record in the data—an ID. An integer set to automatically increment and labeled as the “primary key” in our data table will work.

    _supplies = """
    create table supplies (
        id int auto_increment primary key,
        supply_name varchar(50),
        description varchar(1000)
    ) engine=InnoDB;
    """

02 Create a table for recipes in the database

Following the format for _supplies, define a new variable _recipes that with a SQL statement to create a table. The attributes for this table reflect the header row in recipes.csv.

Solution

_recipes = """
    create table recipes (
        id int auto_increment primary key,
        recipe_slug varchar(50),
        recipe_name varchar(50),
        description varchar(1000),
        recipe_image tinytext,
        rating int,
        url tinytext
    ) engine=InnoDB
"""

03 Run the database module

We don’t need to re-create the tables every time we start Flask. But we DO want to be able to run the code inside database.py to re-create the tables if something goes wrong, which is why we added initialize_db() to the main block:

if __name__ == "__main__":
    initialize_db()

Run database.py to run initialize_db and create the tables:

Make sure you are connected to IU’s VPN if you are not on campus (Ivanti)
Activate the virtual environment inside your lecture repository
Run the database module

source venv/bin/activate
python3 -m flaskapp.database

04 Double check your work

From SILO, use the command line to access MariaDB (login details on the placeholders are on Canvas):

mysql -h HOST -u USERNAME --password=PASSWORD -D DATABASE

Check that the tables are indeed created:

SHOW TABLES;
DESC supplies;
DESC recipes;

Quick Review

We demonstrated:

database connectors with PyMySQL
security concerns
configuration management
running SQL queries from within Python
a “drop + create” approach to setting up database tables

These steps are the basis how we will complete our Flask recipe application.

You should be aware that this is a workflow tailored for this class and this application. In a “real world” application using a database, it’s uncommon to just drop every table whenever we need to make a change. Instagram would be useless if every photo was deleted every time a bug got fixed. But “real world” database changes involve many topics that we will mention vocabulary for, but leave undefined: change management, schema migration, provisioning, backward compatibility, forward compatibility. A database or data management course should cover these topics.³^,⁴

Footnotes

Okay, there can be an “Undo” button, but there are costs to maintaining that “Undo” button, and you (the programmer) are responsible for creating the “Undo” button. There are several schools of thought for how to handle the case where we might need to go back to a previous state of the database. Some of these are well-established and fall under the transaction control umbrella: where we begin a special scope (called a “transaction”) which we can abandon should something go wrong (imagine trying to subtract money from one account and add it to another), and we need to return to the data we had before we attempted the transaction. Other approaches are far more niche: such as attempting to version control a database entirely (e.g., Dolt, TerminusDB), but versioning has overheads and these techniques are quite niche when compared to RDBMSs like MariaDB, MySQL, Oracle, or PostgreSQL. Perhaps the harmonic mean of these two approaches is to architect an entire application around a series of events, and event sourcing and event-driven architectures provides an avenue to revisit previous states by “replaying” the sequence of events up to a given point in history.

We don’t condone this: but GitHub is a great tool if you want to figure out people’s passwords. Anyone can create a GitHub account: but most people do not have formal training in git, GitHub, or secret management. This routinely proved itself to be a recipe for disaster: and people frequently push passwords into GitHub without thinking about all the points we’ve made so far. In fact, this is such a huge problem that GitHub announced a “Secret Scanning Patterns” program: if GitHub detects a password or API key for partner sites (e.g. OpenAI) then GitHub notifies the provider immediately which credentials are compromised.

Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. “Database Systems: The Complete Book”, 2nd Edition, 2008, Pearson Education, Inc.. ISBN: 978-0131873254

⁴

Martin Kleppmann. “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems”, 2017, O’Reilly Media, Inc.. ISBN: 978-1-449-37332-0

An Introduction to Information Infrastructure II