Maintenance File

Using the Maintenance File for Automated Traffic Draining

This tutorial is written from a perspective where VCLGroups are deployed and the Traffic Router is set up correctly for routing traffic.

This tutorial shows how to use the agent’s maintenance file to drain traffic from Varnish nodes. While the Drain Traffic tutorial covers the CLI, UI, and API approach, this tutorial focuses on the file-based mechanism which is useful for automation and scripting.

For a conceptual overview of the maintenance file, see Concepts: Maintenance File.

Prerequisites

A running Varnish Controller setup with at least one agent and one router.
SSH or shell access to the agent host.
The agent’s base directory (default: /var/lib/varnish-controller/<agent-name>/).

Scenario: Rolling Maintenance Across Multiple Nodes

In this example, we have three Varnish nodes (cache1, cache2, cache3) behind a Traffic Router. We want to perform OS updates on each node without dropping client traffic.

Step 1: Identify the Maintenance File Path

The maintenance file is located at <base-dir>/<agent-name>/maintenance by default. If you changed the maintenance-file configuration parameter, use that path instead.

# For an agent named "cache1" with default base directory:
$ ls /var/lib/varnish-controller/cache1/
agent.uid  traffic_router_health.json

The maintenance file does not exist yet, which means the agent is in normal operating mode.

Step 2: Verify Current Health Status

Check the agent’s health check file to confirm the current status:

$ cat /var/lib/varnish-controller/cache1/traffic_router_health.json
{"version":1,"status":"healthy","mbps":245.3,"max_mbps":1000,"score":0.24,"updated_at":"2026-04-14T10:30:00Z"}

The status is healthy, confirming the node is actively receiving routed traffic.

Step 3: Enable Maintenance Mode

Create an empty file at the maintenance file path:

$ touch /var/lib/varnish-controller/cache1/maintenance

Within one second, the agent detects the file and updates its status. Verify:

$ cat /var/lib/varnish-controller/cache1/traffic_router_health.json
{"version":1,"status":"maintenance","mbps":245.3,"max_mbps":1000,"score":0.24,"updated_at":"2026-04-14T10:30:01Z"}

The status has changed to maintenance. The Traffic Router will no longer route new clients to cache1.

Step 4: Wait for Traffic to Drain

The Varnish node still serves existing clients, especially those with cached DNS records. Monitor traffic until it subsides:

# Check active connections using varnishstat
$ varnishstat -1 -f MAIN.sess_conn -f MAIN.client_req

For DNS routing, wait at least the TTL duration of the DNS records before proceeding.

Step 5: Perform Maintenance

With traffic drained, perform your maintenance tasks:

$ sudo apt update && sudo apt upgrade -y
$ sudo systemctl restart varnish

Step 6: Disable Maintenance Mode

Remove the maintenance file to restore the agent to healthy status:

$ rm /var/lib/varnish-controller/cache1/maintenance

Verify the status is restored:

$ cat /var/lib/varnish-controller/cache1/traffic_router_health.json
{"version":1,"status":"healthy","mbps":0,"max_mbps":1000,"score":0,"updated_at":"2026-04-14T10:45:00Z"}

The Traffic Router will begin routing new clients to cache1 again.

Step 7: Repeat for Other Nodes

Repeat steps 3–6 for cache2 and cache3, one at a time.

Automation Example: Ansible Playbook

---
- name: Rolling maintenance for Varnish nodes
  hosts: varnish_nodes
  serial: 1
  vars:
    base_dir: /var/lib/varnish-controller
    drain_wait: 60

  tasks:
    - name: Enable maintenance mode
      file:
        path: "{{ base_dir }}/{{ agent_name }}/maintenance"
        state: touch
        mode: "0644"

    - name: Wait for traffic to drain
      pause:
        seconds: "{{ drain_wait }}"

    - name: Perform OS updates
      apt:
        upgrade: dist
        update_cache: yes

    - name: Restart Varnish
      systemd:
        name: varnish
        state: restarted

    - name: Disable maintenance mode
      file:
        path: "{{ base_dir }}/{{ agent_name }}/maintenance"
        state: absent

    - name: Wait before proceeding to next node
      pause:
        seconds: 10

Verifying via the CLI

You can verify the maintenance status from the Controller CLI at any time:

# List agents and their routing status
$ vcli agent list
+----+--------+---------+--------------+
| ID | Name   | State   | StopRouting  |
+----+--------+---------+--------------+
|  1 | cache1 | Running | true         |
|  2 | cache2 | Running | false        |
|  3 | cache3 | Running | false        |
+----+--------+---------+--------------+

When the maintenance file is created on disk, the StopRouting column reflects the change after the agent reports it to brainz.

Important Notes

Maintenance mode does not undeploy VCL. The Varnish instance keeps serving requests from clients that reach it directly.
The maintenance file must be an actual file (not a directory or symlink). An empty file is sufficient.
If the maintenance file exists when the agent starts, it immediately reports maintenance status.
Creating the maintenance file via the CLI (vcli agent stop-routing) and via the filesystem (touch maintenance) achieve the same result. Mixing both methods simultaneously is safe, the agent reconciles the state.