Complete Guide to Docker Healthchecks and Restart Policies

Your Jellyfin container is running. Docker says it’s healthy. But the web UI returns a blank page and nobody can stream anything. Docker’s default “running” status only tells you the process hasn’t crashed — it says nothing about whether the service actually works.

Docker healthchecks fix this. They let you define what “healthy” actually means for each container, and combined with restart policies, they create a self-healing setup where broken services recover automatically without you waking up at 3 AM.

This guide covers everything: how healthchecks work, how to write good ones, restart policy options, and how to wire them together in Docker Compose for a bulletproof self-hosted stack.

Why Default Docker Status Isn’t Enough

By default, Docker tracks one thing: is the main process (PID 1) running? If yes, the container is “running.” If not, it exited.

This misses a huge category of failures:

  • Deadlocked processes — the app is running but frozen, not accepting connections
  • Database connection loss — the app started but can’t reach its database
  • OOM degradation — the container is alive but thrashing on memory, effectively unusable
  • Config errors — the service started but loaded bad config and returns 500s on every request
  • Port conflicts — the process is running but not bound to the expected port

Healthchecks let you probe the actual service behavior, not just process existence.

How Docker Healthchecks Work

A healthcheck is a command that Docker runs inside the container at regular intervals. Based on the exit code, Docker marks the container as one of three states:

StateMeaning
startingContainer just started, still within the start period
healthyHealthcheck command returned exit code 0
unhealthyHealthcheck failed more times than the retry threshold

The key parameters:

  • test — The command to run
  • interval — How often to run the check (default: 30s)
  • timeout — How long to wait for the check to complete (default: 30s)
  • retries — How many consecutive failures before marking unhealthy (default: 3)
  • start_period — Grace period after container start where failures don’t count (default: 0s)
  • start_interval — Check interval during the start period (Docker 25+, default: 5s)

Writing Healthchecks in Docker Compose

Here’s the basic syntax in a docker-compose.yml:

services:
  myapp:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

The test Field Formats

You have three options for defining the test command:

# Option 1: CMD — runs the command directly (preferred)
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]

# Option 2: CMD-SHELL — runs through /bin/sh (supports pipes, redirects)
test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]

# Option 3: Shell string shorthand (equivalent to CMD-SHELL)
test: curl -f http://localhost:8080 || exit 1

Use CMD when the command is simple. Use CMD-SHELL when you need shell features like ||, &&, pipes, or variable expansion.

The curl Problem

Many minimal Docker images don’t include curl. You’ll get a healthcheck failure not because the service is down, but because the check tool doesn’t exist. Alternatives:

# wget (available in Alpine-based images)
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"]

# Using /dev/tcp in bash (no external tools needed)
test: ["CMD-SHELL", "bash -c '</dev/tcp/localhost/8080' || exit 1"]

# For PostgreSQL
test: ["CMD-SHELL", "pg_isready -U postgres"]

# For Redis
test: ["CMD-SHELL", "redis-cli ping | grep -q PONG"]

# For MariaDB/MySQL
test: ["CMD-SHELL", "mariadb-admin ping -h localhost -u root --password=$$MYSQL_ROOT_PASSWORD || exit 1"]

The double $$ in Compose files escapes the $ sign so the variable is expanded inside the container, not by Compose.

Here are battle-tested healthchecks for services you’re probably running:

Databases

# PostgreSQL
postgres:
  image: postgres:16
  environment:
    POSTGRES_PASSWORD: secretpassword
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U postgres"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 30s

# MariaDB
mariadb:
  image: mariadb:11
  environment:
    MYSQL_ROOT_PASSWORD: secretpassword
  healthcheck:
    test: ["CMD-SHELL", "mariadb-admin ping -h localhost -u root --password=$$MYSQL_ROOT_PASSWORD"]
    interval: 10s
    timeout: 5s
    retries: 5
    start_period: 30s

# Redis
redis:
  image: redis:7-alpine
  healthcheck:
    test: ["CMD", "redis-cli", "ping"]
    interval: 10s
    timeout: 5s
    retries: 3

Web Applications

# Jellyfin
jellyfin:
  image: jellyfin/jellyfin:latest
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8096/health"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 60s

# Nextcloud
nextcloud:
  image: nextcloud:latest
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost/status.php"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 120s

# Vaultwarden
vaultwarden:
  image: vaultwarden/server:latest
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:80/alive"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 20s

Reverse Proxies

# Nginx Proxy Manager
npm:
  image: jc21/nginx-proxy-manager:latest
  healthcheck:
    test: ["CMD-SHELL", "curl -f http://localhost:81/api/ || exit 1"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 30s

# Caddy
caddy:
  image: caddy:2-alpine
  healthcheck:
    test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:2019/config/"]
    interval: 30s
    timeout: 10s
    retries: 3

Docker Restart Policies Explained

Restart policies tell Docker what to do when a container stops. There are four options:

no (default)

restart: "no"

Docker does nothing when the container exits. You have to start it manually. This is the default — which means if you haven’t set a restart policy, your containers won’t survive a server reboot.

always

restart: always

Docker restarts the container no matter what — whether it exited cleanly (code 0), crashed (non-zero), or the Docker daemon restarted (like after a reboot). The only way to stop it is docker stop, and even then it restarts when the daemon starts again.

Best for: Critical services that must always run (reverse proxy, databases, DNS).

unless-stopped

restart: unless-stopped

Same as always, except Docker won’t restart it after a daemon restart if you manually stopped it with docker stop. This is the sweet spot for most self-hosted services.

Best for: Most services. Survives reboots, but respects manual stops.

on-failure

restart: on-failure:5

Only restarts if the container exits with a non-zero exit code. The optional number limits restart attempts. After 5 failures, Docker gives up.

Best for: Batch jobs, migration scripts, or containers that should run once and exit cleanly.

Which Restart Policy Should You Use?

For self-hosting, the answer is almost always unless-stopped. Here’s the decision tree:

  1. Critical infrastructure (reverse proxy, DNS, database) → always
  2. Normal services (Jellyfin, Nextcloud, Gitea) → unless-stopped
  3. One-shot tasks (database migrations, backup scripts) → on-failure or no
  4. Development/testingno

Combining Healthchecks with depends_on

The real power comes from wiring healthchecks into service dependencies. Without healthchecks, depends_on only waits for the container to start — not for the service to be ready:

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secretpassword
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s
    restart: unless-stopped

  app:
    image: myapp:latest
    depends_on:
      postgres:
        condition: service_healthy
    restart: unless-stopped

The condition: service_healthy directive tells Compose to wait until PostgreSQL’s healthcheck passes before starting the app. Without this, your app would start immediately, try to connect to a database that’s still initializing, and crash.

Common depends_on Conditions

depends_on:
  db:
    condition: service_healthy    # Wait for healthcheck to pass
  cache:
    condition: service_started    # Just wait for container start (default)
  migrations:
    condition: service_completed_successfully  # Wait for exit code 0

The service_completed_successfully condition is useful for one-shot init containers like database migrations that need to finish before the app starts.

A Complete Self-Healing Stack Example

Here’s a real-world Compose file that ties everything together — a web app with PostgreSQL, Redis, and proper health/dependency management:

services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    command: redis-server --requirepass ${REDIS_PASSWORD}
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD-SHELL", "redis-cli -a $$REDIS_PASSWORD ping | grep -q PONG"]
      interval: 10s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  app:
    image: myapp:latest
    environment:
      DATABASE_URL: postgres://appuser:${DB_PASSWORD}@postgres:5432/appdb
      REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379
    ports:
      - "8080:8080"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    restart: unless-stopped

volumes:
  postgres_data:
  redis_data:

If PostgreSQL crashes, Docker restarts it. If the app can’t reach the database and its healthcheck starts failing, Docker marks it unhealthy. Combined with a monitoring tool like Uptime Kuma or Beszel, you’ll get alerted before users even notice.

Monitoring Healthcheck Status

Command Line

# Check health status of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"

# Detailed health info for a specific container
docker inspect --format='{{json .State.Health}}' container_name | jq

# Watch healthcheck logs
docker inspect --format='{{range .State.Health.Log}}{{.Output}}{{end}}' container_name

Docker Events

You can subscribe to health status change events:

# Stream health events in real-time
docker events --filter event=health_status

This is useful for integrating with alerting systems — pipe the output to a script that sends notifications on health_status: unhealthy.

Troubleshooting Common Issues

Healthcheck Passes but Service Is Broken

Your healthcheck might be too shallow. Checking if port 8080 is open doesn’t mean the app is functional. If the service has a /health or /api/status endpoint, use that instead of a simple TCP check.

Container Stuck in “starting” State

The start_period is too short for your service. Some apps (Nextcloud, large Java apps) need 60–120 seconds to initialize. Increase start_period and check your container’s startup logs with docker logs.

Healthcheck Can’t Find curl/wget

The image doesn’t include those tools. Options:

  1. Use a built-in check (pg_isready, redis-cli ping)
  2. Use the bash TCP trick: bash -c '</dev/tcp/localhost/PORT'
  3. Build a custom image that adds the tool
  4. Use wget in Alpine-based images (it’s included by default)

Restart Loop (Container Keeps Crashing)

If a container fails, gets restarted, and immediately fails again, Docker applies exponential backoff — waiting longer between each restart attempt (starting at 100ms, doubling each time, capping at ~2 minutes). Check docker logs to find the root cause instead of waiting for it to eventually stay up.

depends_on Not Waiting for Healthy

Make sure you’re using the condition syntax:

# This does NOT wait for health
depends_on:
  - postgres

# This DOES wait for health
depends_on:
  postgres:
    condition: service_healthy

The shorthand list syntax only waits for the container to start, not for it to become healthy.

Best Practices

  1. Always set a start_period — especially for apps with slow initialization. This prevents false unhealthy alerts during startup.

  2. Keep healthchecks lightweight — a simple HTTP request or CLI ping is enough. Don’t run database queries or complex scripts that add load.

  3. Use unless-stopped as your default restart policy — it survives reboots but respects manual stops.

  4. Check what tools the image provides — before writing a curl healthcheck, make sure curl exists in the image.

  5. Set appropriate intervals — 10s for databases, 30s for web apps. Too frequent and you add unnecessary load; too infrequent and you miss problems.

  6. Use depends_on: condition: service_healthy for any service that relies on a database or cache — it eliminates race conditions during startup.

  7. Monitor unhealthy events — healthchecks alone don’t alert you. Pair them with Uptime Kuma, Beszel, or a simple Docker events watcher.

Wrapping Up

Docker healthchecks and restart policies are the foundation of a reliable self-hosted setup. Healthchecks tell you when something is actually broken (not just “running”), restart policies handle recovery automatically, and depends_on conditions prevent startup race conditions.

Add healthchecks to every service in your stack. Start with databases — they’re the most common dependency and the easiest to check. Then add them to web services using their built-in health endpoints. Combined with unless-stopped restart policies, you’ll have a self-healing stack that recovers from most failures without any intervention.

Your future self, not getting paged at 3 AM because Postgres silently wedged itself, will thank you.