SQLite is a lightweight, serverless relational database engine that stores data in a single file on disk.
It is included in Python via the built-in sqlite3 module, allowing you to create, query, and manage databases
without installing external software. SQLite is ideal for small to medium applications, prototyping, and local data storage.
In This Article You Will Learn:
- Create a local SQLite database.
- Define table schemas (DDL).
- Load data from CSV files into our database.
- Run SQL queries to analyze the data.
All files related to this article can be found in my GitHub repository at the following link:
https://github.com/analysislessons/Article_SQLitePractice
The repository contains the following resources:
-
Three CSV files containing the data:
drivers.csv, traffic_violations.csv, and violations_details.csv.
-
Two code files:
sqlite_practice.ipynb, which includes the same descriptions as this article, and sqlite_practice_script.py, which allows for multiple queries and is the most convenient way to run the code and execute the tasks from the end of the article.
The data consists of three entities:
- Drivers: Information about individuals and their vehicle registration.
- Traffic Violations: A log of violation events linking drivers to specific offenses.
- Violations Details: A reference table for types of offenses, points, and fines.
import os
import csv
import sqlite3
# Set database path
DB_NAME = "traffic_violations.db"
if os.path.exists(DB_NAME):
os.remove(DB_NAME) # Start fresh for this tutorial
1. Defining the Table Schemas (DDL)
DDL (Data Definition Language) is used to define the structure of our database. We use the CREATE TABLE command to specify column names and data types.
# 1) Drivers table schema
DDL_DRIVERS = '''
CREATE TABLE IF NOT EXISTS drivers (
PersonID INTEGER PRIMARY KEY,
FirstName TEXT NOT NULL,
LastName TEXT NOT NULL,
RegistrationNumber TEXT NOT NULL
);
'''
# 2) Traffic violations table schema
DDL_TRAFFIC_VIOLATIONS = '''
CREATE TABLE IF NOT EXISTS traffic_violations (
EventID INTEGER PRIMARY KEY,
Date TEXT NOT NULL,
PersonID INTEGER NOT NULL,
OffenseID INTEGER NOT NULL
);
'''
# 3) Violations details table schema
DDL_VIOLATIONS_DETAILS = '''
CREATE TABLE IF NOT EXISTS violations_details (
IdOffense PRIMARY KEY,
Offense TEXT NOT NULL,
Points INTEGER NOT NULL,
Amount INTEGER NOT NULL
);
'''
2. Creating the Database and Tables
Connecting to a database in Python follows a standardized approach defined in PEP 249 (Python Database API Specification v2.0). This specification ensures that different database drivers (for SQLite, PostgreSQL, MySQL, etc.) provide a consistent interface for developers.
The three key elements of this process are:
- Connection Object: Established using a
connect() function, it represents the link to the database file or server.
- Cursor Object: Created from the connection (
conn.cursor()). The cursor is the object used to execute SQL commands and fetch results.
- Commit/Close: For databases that support transactions, changes must be committed to be saved. Finally, connections should be closed to free up system resources.
This standard is why connecting to PostgreSQL looks almost identical to SQLite. Instead of the built-in sqlite3 library, you would typically use a library like psycopg2 (or the newer psycopg), but the workflow of connect -> cursor -> execute -> commit remains exactly the same.
Now, let's connect to our SQLite database file and execute our DDL scripts.
conn = sqlite3.connect(DB_NAME)
cur = conn.cursor()
cur.execute(DDL_DRIVERS)
cur.execute(DDL_TRAFFIC_VIOLATIONS)
cur.execute(DDL_VIOLATIONS_DETAILS)
print("Database and tables created successfully.")
3. Loading Data from CSV Files
We need a helper function to read our CSV files and insert the data into the corresponding SQLite tables. We will handle basic type conversion (e.g., strings to integers) during this process.
def load_csv_to_table(conn, filepath, table_name, columns):
"""Read a comma-separated file and insert rows into the named table."""
insert_sql = f'INSERT OR IGNORE INTO {table_name} ({ ", ".join(columns) }) VALUES ({ ",".join(["?" for _ in columns]) })'
rows = []
if not os.path.exists(filepath):
print(f"File not found: {filepath}")
return
with open(filepath, newline="", encoding="utf-8") as f:
reader = csv.reader(f, delimiter=",")
next(reader) # Skip header row
for r in reader:
# Basic data cleaning: handle empty strings and convert types
vals = [None if v == "" else v for v in r]
for i, col in enumerate(columns):
if col.lower() in ("personid", "eventid", "idoffense", "offenseid", "points", "amount"):
try:
vals[i] = int(vals[i]) if vals[i] is not None else None
except ValueError:
pass
rows.append(tuple(vals))
if rows:
cur = conn.cursor()
cur.executemany(insert_sql, rows)
conn.commit()
print(f"Loaded {len(rows)} rows into {table_name}.")
Now we execute the data loading for all three tables.
load_csv_to_table(conn, "drivers.csv", "drivers", ["PersonID", "FirstName", "LastName", "RegistrationNumber"])
load_csv_to_table(conn, "traffic_violations.csv", "traffic_violations", ["EventID", "Date", "PersonID", "OffenseID"])
load_csv_to_table(conn, "violations_details.csv", "violations_details", ["IdOffense", "Offense", "Points", "Amount"])
4. Verifying the Data
Let us check how many rows were actually inserted into each table.
for tbl in ("drivers", "traffic_violations", "violations_details"):
cur.execute(f'SELECT COUNT(*) FROM {tbl}')
cnt = cur.fetchone()[0]
print(f"{tbl}: {cnt} rows")
Interactive SQL Console
The following code allows the execution of a single custom query. In the attached sqlite_practice_script.py, there is a while loop that enables running multiple queries, not just one; however, in a .ipynb file, this approach does not always work reliably.
Instructions:
- Type your SQL query in the input box.
- Type
exit to stop the loop and move to the next section.
print("\\n" + "="*40)
print("Type your SQL query and press Enter.")
print("="*40)
query_input = input("SQL> ")
try:
cur.execute(query_input)
if query_input.strip().lower().startswith("select"):
rows = cur.fetchall()
cols = [d[0] for d in cur.description] if cur.description else []
if cols:
print(" | ".join(cols))
print("-" * 40)
for r in rows:
print(r)
else:
conn.commit()
print(f"Query executed. Rows affected: {cur.rowcount}")
except Exception as e:
print(f"Error: {e}")
Practice tasks
Use the Interactive SQL Console above, either by running sqlite_practice_script.py or sqlite_practice.ipynb, to solve these tasks. Check your answers against the solutions provided below.
Task 1: Basic Retrieval
Goal: List all drivers' first and last names, sorted alphabetically by their last name.
SELECT FirstName, LastName FROM drivers ORDER BY LastName ASC;
Task 2: Filtering Data
Goal: Find all types of offenses that result in 5 or more penalty points.
SELECT Offense, Points FROM violations_details WHERE Points >= 5;
Task 3: Aggregation
Goal: Calculate the total number of traffic violation events recorded in the system.
SELECT COUNT(*) FROM traffic_violations;
Task 4: Simple Join
Goal: List the dates of violations along with the registration numbers of the cars involved.
SELECT v.Date, d.RegistrationNumber
FROM traffic_violations v
JOIN drivers d ON v.PersonID = d.PersonID;
Task 5: The Challenge
Goal: Find the top 5 drivers who have accumulated the highest total number of penalty points.
SELECT d.FirstName, d.LastName, SUM(det.Points) as TotalPoints
FROM traffic_violations v
JOIN drivers d ON v.PersonID = d.PersonID
JOIN violations_details det ON v.OffenseID = det.IdOffense
GROUP BY d.PersonID
ORDER BY TotalPoints DESC
LIMIT 5;
Task 6: Grouping and Having
Goal: For each type of offense, find the total amount of fines collected, but only for offenses that have occurred more than 5 times.
SELECT det.Offense, COUNT(v.EventID) as Occurrences, SUM(det.Amount) as TotalFines
FROM traffic_violations v
JOIN violations_details det ON v.OffenseID = det.IdOffense
GROUP BY det.Offense
HAVING Occurrences > 5;
5. Query database using Pandas
We can run queries using the Pandas library, which performs part of the process internally. This allows us to write less code, display results in a nicely formatted way as a pandas DataFrame, and perform further analysis efficiently.
query = '''
SELECT
v.Date,
d.FirstName,
d.LastName,
det.Offense,
det.Points,
det.Amount
FROM traffic_violations v
JOIN drivers d ON v.PersonID = d.PersonID
JOIN violations_details det ON v.OffenseID = det.IdOffense
LIMIT 10;
'''
import pandas as pd
df = pd.read_sql_query(query, conn)
display(df)