Docker Compose

Craft runs as a set of Docker containers orchestrated by a single docker-compose.yml file. The compose configuration uses profiles to switch between development and production modes, a dependency chain to ensure services start in the right order, and Caddy labels for automatic reverse proxy configuration.

The profile system

Docker Compose profiles divide services into two groups: those that always run, and those that run only in a specific mode.

Always running (no profile):

db — PostgreSQL 17 (TimescaleDB HA)
pgai-install — one-shot container that installs the pgai extension
vectorizer-worker — continuously polls for rows needing embeddings

Dev profile (--profile dev):

api-dev — FastAPI with --reload flag, source code volume-mounted
web-dev — Astro dev server with HMR, source and public dirs mounted
docs-dev — Starlight dev server with source mounted

Prod profile (--profile prod):

api-prod — FastAPI with 2 Uvicorn workers, no mounts, non-root user
web-prod — Static Astro build served by Node
docs-prod — Static Starlight build

The CRAFT_MODE environment variable in .env is a convention for documentation and scripts, but the actual mode selection happens through the --profile flag:

docker compose --profile dev up -d    # development
docker compose --profile prod up -d   # production

The make dev and make prod targets in the Makefile wrap these commands.

Why profiles instead of separate compose files

An earlier version used docker-compose.override.yml for dev settings, which is the Docker Compose convention. Profiles are better here because:

The shared services (db, pgai-install, vectorizer-worker) are defined once, not duplicated
You can see the entire service topology in a single file
Profile membership is declared on each service, making it obvious which mode a service belongs to
There is no risk of accidentally running production containers with dev overrides applied

Service dependency chain

Services declare dependencies with health check conditions, which ensures they start in the correct order:

graph TD
    DB["db\n(healthy)"] --> PGAI["pgai-install\n(completed successfully)"]
    PGAI --> VW["vectorizer-worker"]
    DB --> API["api-dev / api-prod"]
    PGAI --> API
    DB --> WEB["web-dev / web-prod"]
    DB --> DOCS["docs-dev / docs-prod"]

The db service has a health check that runs pg_isready every 10 seconds. No other service starts until PostgreSQL is accepting connections.

The pgai-install container depends on db: service_healthy. It runs python -m pgai install, which connects to the database and installs the pgai extension with vectorizer support. This container exits immediately after installation — it has restart: "no" so it does not restart after completing.

The vectorizer-worker depends on both db: service_healthy and pgai-install: service_completed_successfully. It cannot start until the extension it relies on has been installed. Once running, it polls the database every 5 seconds for rows that need embedding.

The API services depend on db: service_healthy and pgai-install: service_completed_successfully. They do not depend on the vectorizer worker — the API functions correctly without embeddings (text search still works), so there is no reason to block API startup waiting for the embedding pipeline.

Caddy reverse proxy routing

Each service that needs external access declares Caddy routing rules through Docker labels. The caddy-docker-proxy plugin reads these labels and generates a Caddyfile dynamically.

Path-based routing on a single domain

All services share the CRAFT_DOMAIN (e.g., space.warehack.ing). Caddy routes based on the request path:

API (/api/* and /ws/*):

labels:
  caddy: ${CRAFT_DOMAIN}
  caddy.handle: /api/*
  caddy.handle.0_reverse_proxy: "{{upstreams 8000}}"
  caddy.handle_1: /ws/*
  caddy.handle_1.0_reverse_proxy: "{{upstreams 8000}}"

The /ws/* handler includes additional configuration for WebSocket connections:

  caddy.handle_1.0_reverse_proxy.flush_interval: "-1"
  caddy.handle_1.0_reverse_proxy.transport: http
  caddy.handle_1.0_reverse_proxy.transport.read_timeout: "0"
  caddy.handle_1.0_reverse_proxy.transport.write_timeout: "0"

Setting flush_interval to -1 disables response buffering, which is essential for streaming protocols. The zero-value timeouts prevent Caddy from closing idle WebSocket connections — the tracking WebSocket sends data at 1 Hz, but gaps during computation can exceed Caddy’s default idle timeout.

Docs (/docs/*):

labels:
  caddy: ${CRAFT_DOMAIN}
  caddy.handle_path: /docs/*
  caddy.handle_path.0_reverse_proxy: "{{upstreams 3000}}"

The handle_path directive (rather than handle) strips the /docs prefix before forwarding to the upstream. This is necessary because Astro’s base: '/docs' config adds the prefix to all generated links, but the static build output lives at the root of the dist/ directory — the serve container expects requests at /, not /docs/.

The dev mode docs service also configures streaming parameters for Vite’s HMR WebSocket.

Web (catch-all /*):

labels:
  caddy: ${CRAFT_DOMAIN}
  caddy.reverse_proxy: "{{upstreams 4321}}"

The web service uses reverse_proxy directly (not inside a handle block), so it acts as the catch-all for any path that does not match /api/*, /ws/*, or /docs/*.

The web service also handles legacy domain redirects:

  caddy_1: astrolock.warehack.ing
  caddy_1.redir: "https://${CRAFT_DOMAIN}{uri} permanent"

This issues a 301 redirect from the old domain to the current one, preserving the request path.

How Caddy routing precedence works

Caddy’s handle directive processes blocks in order and stops at the first match. The caddy-docker-proxy plugin combines labels from all containers on the same domain into a single virtual Caddyfile. Because handle /api/*, handle /ws/*, and handle_path /docs/* appear before the catch-all reverse_proxy, specific paths are routed to their respective services, and everything else falls through to the web frontend. The docs service uses handle_path instead of handle so that the /docs prefix is stripped before proxying — the static file server expects paths relative to its root.

Network topology

The compose file defines two networks:

internal — An isolated bridge network for service-to-service communication. The database, vectorizer worker, API, web, and docs services are all on this network. The database is only reachable from this network — it has no Caddy labels and no connection to the external network.

caddy — An external network (created outside this compose file) shared with the caddy-docker-proxy instance. Services that need to receive traffic from the internet are attached to both internal and caddy. The Caddy container reads their labels, discovers their IP on the caddy network, and proxies traffic to them.

This two-network design means:

The database is never directly exposed to the internet
The vectorizer worker can reach the database and the GPU endpoint but has no external attack surface
The API, web, and docs services can reach the database through internal and receive proxied traffic through caddy
The Caddy container does not need access to the database

Volume management

Two named volumes persist data across container restarts:

pg-data — PostgreSQL data directory, mounted at /home/postgres/pgdata/data inside the TimescaleDB container. This volume holds all database tables, indexes, and WAL files. Destroying this volume (docker compose down -v) deletes all data.

api-data — Skyfield ephemeris and timescale data, mounted at /data inside the API container. This volume caches the DE421 planetary ephemeris (~17 MB) and delta-T files. Without this volume, Skyfield would re-download the ephemeris on every container restart.

In dev mode, additional bind mounts overlay source code into the containers:

Service	Bind mount	Purpose
`api-dev`	`./packages/api/src:/app/src`	Python source hot-reload
`api-dev`	`./packages/api/alembic:/app/alembic`	Migration files
`web-dev`	`./packages/web/src:/app/src`	Astro source HMR
`web-dev`	`./packages/web/public:/app/public`	Static assets
`web-dev`	`./packages/web/astro.config.mjs:/app/astro.config.mjs`	Astro config
`docs-dev`	`./docs/src:/app/src`	Documentation source
`docs-dev`	`./docs/astro.config.mjs:/app/astro.config.mjs`	Docs config

These mounts mean that editing source files on the host triggers automatic reloading inside the container — Uvicorn’s --reload for the API, and Vite’s HMR for the web and docs frontends.

Dev mode features

Hot reload

The API dev container runs Uvicorn with --reload, which watches the mounted src/ directory for changes and restarts the server. The Astro and Starlight dev containers run their respective dev servers, which use Vite’s HMR (Hot Module Replacement) to push changes to the browser without a full page reload.

HMR through Caddy

Vite’s HMR uses a WebSocket connection from the browser to the dev server. When running behind Caddy, this WebSocket needs to connect to the Caddy domain (not localhost), use WSS (not WS), and route through port 443. The VITE_HMR_HOST environment variable controls this:

Set (e.g., VITE_HMR_HOST=space.warehack.ing): Vite configures HMR to connect via wss://space.warehack.ing:443
Empty or unset: Vite auto-detects localhost settings, which is correct for local development without Caddy

The dev mode Caddy labels include extended timeouts and keepalive settings to prevent Caddy from closing the HMR WebSocket during idle periods.

Health checks and restart policies

Health checks

The API containers (both dev and prod) run a Python health check every 30 seconds:

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health')"]
  interval: 30s
  timeout: 5s
  retries: 3
  start_period: 15s

The health check uses Python’s standard library to avoid depending on curl (which is not installed in the slim container image). It hits the /health endpoint directly on localhost, bypassing Caddy. The 15-second start period gives the API time to load the Skyfield ephemeris and start accepting connections.

The database health check uses pg_isready with a 10-second interval, which is the standard PostgreSQL readiness probe.

Restart policies

Service	Restart policy	Reason
`db`	`unless-stopped`	Database should always be running
`pgai-install`	`no`	One-shot; should not restart after success
`vectorizer-worker`	`unless-stopped`	Should recover from transient failures
API, web, docs	`unless-stopped`	Should survive crashes and host reboots

The unless-stopped policy means containers restart automatically after crashes or host reboots, but stay stopped if you explicitly stop them with docker compose stop. This is the right default for a long-running service — you want automatic recovery from failures, but you do not want stopped services resurrecting themselves when you are debugging.

Production hardening

The prod API container differs from dev in two ways:

Non-root user: The Dockerfile creates an astrolock user and switches to it before running the application. The data directory ownership is adjusted accordingly.
Multiple workers: Uvicorn runs with --workers 2 instead of --reload. The --reload flag is incompatible with multiple workers and unnecessary in production since the source code is baked into the image.

Both dev and prod pass --proxy-headers and --forwarded-allow-ips "*" to Uvicorn, which tells it to trust the X-Forwarded-For and X-Forwarded-Proto headers from Caddy. Without these flags, the API would see all requests as coming from Caddy’s internal IP rather than the real client.