Home

Nikki Everett

12 Jun 2023

Transforming Chicago traffic crash data with Prefect and dbt

I put together a data engineering and analysis project using data I had already started exploring from the Chicago Data Portal (specifically, traffic crashes data). (Originally I was interested in exploring and analyzing this data to see if there was any correlation between accident severity and particular make/model/year of car, specifically newer makes and models with higher profiles and decreased visibility.)

I structured the first version of this project as a series of tasks, with each task in a separate directory that held a handful of Python scripts (setup, test, etc) and a Makefile to coordinate their execution. To execute all tasks, I switched into each directory and ran make; I intended to write a higher-level script to automate that bit later. However, I decided to refactor the project with Prefect Core and dbt Core to get more experience with industry-standard tools.

Prefect

I’ve enjoyed using Prefect for orchestration – it’s got a lot of nice features without being too configuration-heavy (looking at you, Dagster). It was easy to start small with one Python file for tasks and flows, and add functionality as I needed it (a common.py file for commonly-used functions, deployment.py to define a deployment schedule.) I’ve mostly been running Prefect on the command line, but it’s been nice to fire up the server and see all my runs and scheduled deployments.

I defined the following tasks and flows in Prefect:

  • Download CSVs of crashes, vehicles, and people data from the Chicago Open Data Portal website. (There’s a JSON API, but for the purposes of this project, I found it faster and easier to just download the CSVs.)
  • Load data from the CSVs into raw tables.
  • Kick off a run of dbt to build models of staging tables and analytics views built on the staging tables.

dbt

I originally intended this project to be an exercise in making an (almost) entirely SQL-based ETL pipeline to sharpen my SQL skills, so dbt felt like a natural fit when it came time to refactor. Things I’ve liked about dbt:

  • It is very easy to define models and their relationships to each other and see how those models change over time.
  • Testing is a breeze, and actually kind of fun.

Things that have been a little funky:

  • if you create a YAML config file in the /models directory, it overrides the main YAML config file in the project’s top-level directory. Which makes sense except for the fact that having a /models config seems to blow away unrelated settings defined in the top-level directory config. For example: in my top-level config, I’d set the default materialization for /models/base_tables to table, but the mere presence of a model-level config file with an unrelated setting (in my case, a source) caused those models to be materialized as views. I’m guessing this is because the presence of the model-level config imports a bunch of default settings at the model level, which then override the project-level settings. Not the biggest deal, since I just set the materialization at the top of those model files, but it was still confusing to debug. (I also assume I can set the materialization for those models in the /models config file, but I’m having trouble getting the syntax right.)

Future work

Lots of stuff!

  • Use the JSON API instead of downloading CSVs
  • Add more tests, for both Prefect tasks/flows and dbt models
  • Add more analytics models to dbt