How to set up Airflow 2.0 with PostgreSQL

Setting up airflow + postgres, was surprisingly more difficult (and undocumented) than I imagined. In this post, I document all steps taken to docker-build and connect airflow and postgres containers, intialize the services, and start the airflow web user interface.

Cover image

Goal

In this guide, we will run Airflow and Postgres in Docker containers, connect the two, and get to the log in page:

image

We will use the latest versions as of April 2021:

  • Airflow 2.0.1
  • PostgreSQL 13.2

Prerequisites

  • An Ubuntu VM with Docker installed (you can follow this guide)
  • Working knowledge of Docker
  • A DockerHub account

Step 1. Get docker images

Unless otherwise noted, all commands in this post are run on the VM.

For PostgreSQL, we will use the official docker image:

docker pull postgres:13.2

For Airflow, we will use this Dockerfile:

FROM python:3.8-slim-buster
RUN apt-get update && apt-get install -y vim
RUN \
    pip install apache-airflow[docker,amazon]==2.0.1 --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.0.1/constraints-3.8.txt \
    && pip install psycopg2-binary
ENV AIRFLOW_HOME=/airflow
CMD ["airflow"]

We're also installing:

The psycopg2-binary package is required for airflow to be able to speak postgres. The two provider packages docker and amazon are optional, but are needed for my particular project.

Build it like this:

docker build --tag airflow .

Step 2. Create a docker network

We will have 2 airflow containers that need to connect to a postgres container. How do the airflow containers find the postgres container? By creating a network and connecting the containers to the network. With this setup, the containers can discover each other by name.

Create a network:

docker network create af-pg-net

We can now connect containers to this af-pg-net when starting them:

docker run --network=af-pg-net [...] IMAGE

or, to connect a running container to a network:

docker network connect af-pg-net CONTAINER

To view existing networks:

$ docker network ls
NETWORK ID     NAME        DRIVER    SCOPE
eff5ba9e7c9e   af-pg-net   bridge    local
...

Resources:

Step 3. Start postgres and create a database for airflow

In this step, we will start the postgres container, and configure it so that:

  • airflow can access postgres (user name and password)
  • airflow has a database where it can create/maintain tables it needs

Step 3a. Start postgres

First, we start the postgres container. Here's the full command.

docker run --rm -d --name postgres -e POSTGRES_PASSWORD=airflow2104 --network af-pg-net -p 5432:5432 -v ~/data:/var/lib/postgresql/data postgres:13.2

Details:

  • --network af-pg-net Connect this container to the af-pg-net network. The airflow container will also be conneted to this same network, so that it can find postgres by host name (name of container).
  • --name postgres Name of the container. Other containers (on the same network) can find this container by its name postgres.
  • -p 5432:5432 We expose the port which postgres listens on (the RHS) to the same port number on host (the LHS).

With the 3 options above, other containers can now find this container at postres:5432 on the af-pg-net network.

We also specify these options:

  • -v ~/data:/var/lib/postgresql/data The RHS (/var/lib/postgresql/data inside the container) is where postgres stores everything -- tables, settings, etc. We can delete/lose the container (which is the point of using Docker!) but we must not lose this folder. We will map it to ~/data on the host, which we can choose to back up by, for example, taking snapshots (not covered in this post).

  • -e POSTGRES_PASSWORD=airflow2104 This is a required environment variable for the postgres image.

    • One way to not type it into the command line, is to keep it in a file like ~/.pgpass and read it: docker run -e POSTGRES_PASSWORD=$(cat ~/.pgpass) ....
    • If we don't specify it, we get an error message:
      $ docker run --rm -it postgres:13.2
      Error: Database is uninitialized and superuser password is not specified.
             You must specify POSTGRES_PASSWORD to a non-empty value for the
             superuser. For example, "-e POSTGRES_PASSWORD=password" on "docker run".
             You may also use "POSTGRES_HOST_AUTH_METHOD=trust" to allow all
             connections without a password. This is *not* recommended.
             ...
      
  • --rm Automatically remove container when it exist. The important stuff (the databse data) will be persisted on the host with the -v flag.

  • -d Detach immediately and run in background.
  • postgres:13.2 This is the image we are going to run.

Ok, let's actually run it:

$ docker run --rm -d --name postgres -e POSTGRES_PASSWORD=airflow2104 --network af-pg-net -p 5432:5432 -v ~/data:/var/lib/postgresql/data postgres:13.2
a658b72b44349ee3e2570dbf85e3547b1ed2589637d9e758bd00cc7518f126a0

$ docker ps
CONTAINER ID   IMAGE           COMMAND                  CREATED              STATUS              PORTS                    NAMES
a658b72b4434   postgres:13.2   "docker-entrypoint.s…"   About a minute ago   Up About a minute   0.0.0.0:5432->5432/tcp   postgres

It's running!

Let's take at look at the data folder:

$ cd ~/data
-bash: cd: data: Permission denied

$ ls -la ~
...
drwx------ 19 systemd-coredump root   4096 Apr 13 07:48 data
...

As you can see, the owner/group of this folder looks odd. This is because of how postgres' Dockefile is written. If you want to back up this folder, you'll likely need to fix the owner/permissions of this folder. This stackoverflow answer may help but I have not tried it.

Just for fun, let's peek inside with sudo:

$ sudo ls ~/data  # enter the root password for your linode VM
PG_VERSION  pg_commit_ts  pg_ident.conf  pg_notify    pg_snapshots  pg_subtrans  pg_wal                postgresql.conf
base        pg_dynshmem   pg_logical     pg_replslot  pg_stat       pg_tblspc    pg_xact               postmaster.opts
global      pg_hba.conf   pg_multixact   pg_serial    pg_stat_tmp   pg_twophase  postgresql.auto.conf  postmaster.pid

Ok, let's try connecting to postgres using the psql client:

$ psql -h localhost -p 5432 -U postgres
Password for user postgres: 
psql (12.6 (Ubuntu 12.6-0ubuntu0.20.04.1), server 13.2 (Debian 13.2-1.pgdg100+1))
WARNING: psql major version 12, server major version 13.
         Some psql features might not work.
Type "help" for help.

postgres=#

Now we're in the psql repl:

postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of 
-----------+------------------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}

postgres=# \dt
Did not find any relations.
postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
(3 rows)

postgres=# \q
$

Fun (and potentially scary) fact: you can access this postgres from your laptop by replacing -h localhost with your Linode's IP address, like: -h 123.123.123.123. Make sure to use a strong password!

A quick review of postgres commands:

command what it does
\l list databases
\c DBNAME connect to database
\c DBNAME USER connect to database as user
\c postgres postgres back to default user (postgres)
\dt list tables and relations in current database
\du list users
\d+ TABLE show table schema
\q quit psql REPL

To reset a user's password, log in as admin user, then issue \password

$ psql -h localhost -p 5432 -U postgres

postgres=# \password USER
Enter new password: 
Enter it again:

postgres=# \q

Step 3b. Create a database for airflow

Start the psql client:

$ psql -h localhost -p 5432 -U postgres
Password for user postgres: 
...

Once connected, create an user and a database for airflow.

postgres=# CREATE USER airflow PASSWORD 'airflow';
CREATE ROLE

postgres=# CREATE DATABASE airflow;
CREATE DATABASE

postgres=# GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;
GRANT

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
-----------+----------+----------+------------+------------+-----------------------
 airflow   | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =Tc/postgres         +
           |          |          |            |            | postgres=CTc/postgres+
           |          |          |            |            | airflow=CTc/postgres
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
(4 rows)

postgres=# \q

Step 4. Start Airflow

Step 4a. Initialize $AIRFLOW_HOME

Create an empty folder at ~/ariflow. We will mount this into the airflow container at /airflow which is the default $AIRFLOW_HOME directory.

$ mkdir ~/airflow
$ ls ~/airflow  # empty

Run the airflow container without any commands. It will complain, but populate the /airflow directory with a few files.

$ docker run --rm -it  -v ~/airflow:/airflow airflow db init
usage: airflow [-h] GROUP_OR_COMMAND ...
...
airflow command error: the following arguments are required: GROUP_OR_COMMAND, see help above.

$ ls ~/airflow  # we now have config files!
airflow.cfg  airflow.db  logs  unittests.cfg  webserver_config.py

Step 4b. Modify airflow.cfg

Open ~/airflow/airflow.cfg in an editor (with sudo) and edit it as follows:

$ cat ~/airflow/airflow.cfg
...
executor = LocalExecutor
...
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
...
load_examples = False
...

Notes:

  • We switch to LocalExecutor so DAG runs can be parallelized.

  • About the sql connection string sql_alchemy_conn:

    • The format is: postgresql+psycopg2://<user>:<password>@<host>/<db>.

    • For <host> we use postgres, which is the docker image name. As long as both the postgres and airflow containers are on the same network (docker run --net dockernet [...]) they can find each other by host name (link)

  • Turn off the load_examples flag now, because I didn't find them particularly useful in learning, they just clutter the web ui, and most importantly, you need to stop/start the airflow scheduler for changes to airflow.cfg to take effect -- something you don't want to do when DAGs are running.

Now we can initialize the airflow db with airflow db init (docs):

$ docker run --rm -it -v ~/airflow:/airflow airflow --net af-pg-net db init
DB: postgresql+psycopg2://airflow:***@postgres:5432/airflow
[2021-04-15 06:33:57,328] {db.py:674} INFO - Creating tables
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
...
...
Initialization done

Let's take a look in postgres to see what airflow created:

$ psql -h localhost -p 5432 -U postgres
Password for user postgres: 
...

airflow=> \dt
                    List of relations
 Schema |             Name              | Type  |  Owner  
--------+-------------------------------+-------+---------
 public | ab_permission                 | table | airflow
 public | ab_permission_view            | table | airflow
 public | ab_permission_view_role       | table | airflow
 public | ab_register_user              | table | airflow
 public | ab_role                       | table | airflow
 public | ab_user                       | table | airflow
 public | ab_user_role                  | table | airflow
 public | ab_view_menu                  | table | airflow
 public | alembic_version               | table | airflow
 public | connection                    | table | airflow
 public | dag                           | table | airflow
 public | dag_code                      | table | airflow
 public | dag_pickle                    | table | airflow
 public | dag_run                       | table | airflow
 public | dag_tag                       | table | airflow
 public | import_error                  | table | airflow
 public | job                           | table | airflow
 public | log                           | table | airflow
 public | rendered_task_instance_fields | table | airflow
 public | sensor_instance               | table | airflow
 public | serialized_dag                | table | airflow
 public | sla_miss                      | table | airflow
 public | slot_pool                     | table | airflow
 public | task_fail                     | table | airflow
 public | task_instance                 | table | airflow
 public | task_reschedule               | table | airflow
 public | variable                      | table | airflow
 public | xcom                          | table | airflow
(28 rows)

airflow=> \q

We also execute the airflow users create command to create the admin account.

$ docker run --rm -it -v ~/airflow:/airflow --net af-pg-net airflow users create -u admin -e your@email.com -f First -l Last -r Admin -p $(cat ~/.afpass)
...
Admin user admin created

Step 4c. Start airflow

Finally, we can start the webserver:

$ docker run -d --rm -it -p 8080:8080 -v ~/airflow:/airflow --name airflow_webserver --net af-pg-net airflow webserver

... and the scheduler:

$ docker run -d --rm -it -v ~/airflow:/airflow -v /var/run/docker.sock:/var/run/docker.sock --name airflow_scheduler --net af-pg-net airflow scheduler

In the command above, we mount /var/run/docker.sock so that ...

Running containers are:

$ docker ps -a
CONTAINER ID   IMAGE           COMMAND                  CREATED          STATUS          PORTS                    NAMES
bf8aebfbd654   airflow         "airflow scheduler"      3 seconds ago    Up 2 seconds                             airflow_scheduler
39fb37e82b89   airflow         "airflow webserver"      11 seconds ago   Up 10 seconds   0.0.0.0:8080->8080/tcp   airflow_webserver
06d6ad7fd1e0   postgres:13.2   "docker-entrypoint.s…"   24 hours ago     Up 24 hours     0.0.0.0:5432->5432/tcp   postgres

🎉 And finally, access the web interface at <IP>:8080:

image