Skip to content
Guides/Varnish Cache
Varnish Cache · HTTP Accelerator · Reverse Proxy9 min read

How to Block AI Bots on Varnish Cache: Complete 2026 Guide

Varnish Cache is a high-performance HTTP accelerator (caching reverse proxy) used by major media publishers, e-commerce platforms, and CDNs. It is configured entirely through VCL (Varnish Configuration Language) — a domain-specific language for HTTP request handling. Bot blocking in Varnish is done in the vcl_recv subroutine, before cache lookup and before any backend hit.

vcl_recv — block bots before cache lookup

vcl_recv is the first subroutine called for every incoming request — it runs before cache lookup, before backend selection, and before any backend connection. This is the correct place to block bots: zero backend load, zero cache pollution.

vcl 4.1;

import std;

sub vcl_recv {
    # Block AI training and scraping bots by User-Agent
    if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
        return(synth(403, "Forbidden"));
    }
}
VCL regex syntax: The ~ operator does PCRE regex matching. The (?i) flag at the start makes the entire pattern case-insensitive. Alternatives separated by | inside the group. Unlike nginx or HAProxy, Varnish requires a single regex — you cannot list values space-separated.
Block in vcl_recv, not vcl_hit or vcl_miss. vcl_hit runs only when there is a cache hit — bots on uncached URLs would bypass the check. vcl_recv runs unconditionally for every request.

Block and log (using std.log)

vcl 4.1;

import std;

sub vcl_recv {
    if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
        std.log("AI bot blocked: " + req.http.User-Agent);
        return(synth(403, "Forbidden"));
    }
}

std.log() writes to the Varnish shared memory log (VSL), readable with varnishlog -g request -q "VCL_Log ~ \"AI bot\"".

vcl_synth — custom 403 response

When return(synth(403, "Forbidden")) is called in vcl_recv, Varnish calls vcl_synth to build the synthetic response. Customise it to return a clean response body:

sub vcl_synth {
    if (resp.status == 403) {
        set resp.http.Content-Type = "text/plain; charset=utf-8";
        set resp.http.X-Robots-Tag = "noindex";
        synthetic("Forbidden" + {"
"});
        return(deliver);
    }

    # Default synth handling for other status codes
    return(deliver);
}
VCL here-doc syntax: { + text + } is VCL's long-string syntax — equivalent to a here-doc. The newline after Forbidden is inside the long string. Use it when your synthetic body contains special characters or line breaks.

Return JSON for API consumers

sub vcl_synth {
    if (resp.status == 403) {
        set resp.http.Content-Type = "application/json; charset=utf-8";
        synthetic({"{"status":403,"error":"Forbidden"}"});
        return(deliver);
    }
}

X-Robots-Tag in vcl_backend_response / vcl_deliver

Add X-Robots-Tag to all responses. Two options depending on when you want to set it:

vcl_backend_response — set on backend response (before caching)

sub vcl_backend_response {
    # Add X-Robots-Tag to all backend responses
    # This value is cached alongside the object
    set beresp.http.X-Robots-Tag = "noai, noimageai";
}
Cached with the object: Headers set in vcl_backend_response are stored in Varnish's cache alongside the object. All subsequent cache hits will include the header without another backend request.

vcl_deliver — set on delivery to client (after cache lookup)

sub vcl_deliver {
    # Set X-Robots-Tag on every response sent to the client
    # Use this if you need to set/override regardless of cache state
    set resp.http.X-Robots-Tag = "noai, noimageai";

    # Optional: remove internal headers before delivery
    unset resp.http.X-Varnish;
    unset resp.http.Via;
}

vcl_deliver runs just before sending the response to the client — it can override headers set in vcl_backend_response. Use it when you need unconditional header injection regardless of cache state.

Serving robots.txt from VCL

Serve robots.txt directly from Varnish without a backend hit:

sub vcl_recv {
    # Serve robots.txt directly from Varnish (no backend hit)
    if (req.url == "/robots.txt") {
        return(synth(200, "OK"));
    }

    # ... rest of vcl_recv
}

sub vcl_synth {
    if (resp.status == 200 && req.url == "/robots.txt") {
        set resp.http.Content-Type = "text/plain; charset=utf-8";
        set resp.http.Cache-Control = "public, max-age=86400";
        synthetic({"User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: YouBot
Disallow: /

Sitemap: https://example.com/sitemap.xml
"});
        return(deliver);
    }

    if (resp.status == 403) {
        set resp.http.Content-Type = "text/plain; charset=utf-8";
        synthetic("Forbidden");
        return(deliver);
    }
}

Rate limiting with vsthrottle

The vsthrottle VMOD provides per-key rate limiting. It's available in the varnish-modules package (open source) and bundled with Varnish Enterprise:

vcl 4.1;

import vsthrottle;

sub vcl_recv {
    # Block AI bots by UA first (fastest path)
    if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
        return(synth(403, "Forbidden"));
    }

    # Rate limit: 100 requests per 10 seconds per IP
    # Key: client IP (use X-Forwarded-For if behind a load balancer)
    if (vsthrottle.is_denied(req.http.X-Forwarded-For, 100, 10s)) {
        return(synth(429, "Too Many Requests"));
    }
}

Install varnish-modules (Ubuntu/Debian)

apt-get install varnish-modules

Install varnish-modules (from source)

git clone https://github.com/varnish/varnish-modules.git
cd varnish-modules
./bootstrap
./configure
make
make install
vsthrottle key selection: Using client.ip as the key works for direct connections. If Varnish is behind a load balancer, use req.http.X-Forwarded-For — but validate it first to prevent IP spoofing. For production, consider a trusted IP header from your load balancer (e.g. req.http.X-Real-IP).

VCL ACL for IP-based exceptions

VCL's acl statement defines IP ranges. Use it to whitelist your own crawlers or monitoring services from the bot-blocking rules:

vcl 4.1;

import std;
import vsthrottle;

# Trusted IPs — bypass bot blocking (your own crawlers, monitoring)
acl trusted_crawlers {
    "127.0.0.1";
    "10.0.0.0"/8;
    "192.168.0.0"/16;
    "203.0.113.42";     # your monitoring service IP
}

sub vcl_recv {
    # Bypass all checks for trusted crawlers
    if (client.ip ~ trusted_crawlers) {
        return(pass);
    }

    # Block AI bots
    if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
        std.log("AI bot blocked: " + req.http.User-Agent);
        return(synth(403, "Forbidden"));
    }
}

Full VCL example

vcl 4.1;

import std;
import vsthrottle;

# Backend definition
backend default {
    .host = "127.0.0.1";
    .port = "8080";
    .connect_timeout = 5s;
    .first_byte_timeout = 30s;
    .between_bytes_timeout = 10s;
    .probe = {
        .url = "/health";
        .timeout = 2s;
        .interval = 5s;
        .window = 5;
        .threshold = 3;
    }
}

# Trusted IPs — bypass bot blocking
acl trusted_crawlers {
    "127.0.0.1";
    "10.0.0.0"/8;
    "192.168.0.0"/16;
}

sub vcl_recv {
    # Health check passthrough
    if (req.url == "/health") {
        return(pass);
    }

    # Serve robots.txt from Varnish directly
    if (req.url == "/robots.txt") {
        return(synth(800, "robots"));
    }

    # Trusted IPs bypass bot blocking
    if (client.ip ~ trusted_crawlers) {
        return(pass);
    }

    # Block AI bots by User-Agent
    if (req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)") {
        std.log("AI bot blocked UA: " + req.http.User-Agent);
        return(synth(403, "Forbidden"));
    }

    # Rate limiting: 200 req / 10s per IP
    if (vsthrottle.is_denied(req.http.X-Forwarded-For + req.http.User-Agent, 200, 10s)) {
        return(synth(429, "Too Many Requests"));
    }

    # Strip cookies on static assets (allow caching)
    if (req.url ~ ".(css|js|png|jpg|jpeg|gif|ico|woff2?|svg)$") {
        unset req.http.Cookie;
    }

    return(hash);
}

sub vcl_backend_response {
    # Add X-Robots-Tag to all backend responses (cached with object)
    set beresp.http.X-Robots-Tag = "noai, noimageai";

    # Cache static assets for 1 day
    if (bereq.url ~ ".(css|js|png|jpg|jpeg|gif|ico|woff2?|svg)$") {
        set beresp.ttl = 1d;
        set beresp.http.Cache-Control = "public, max-age=86400";
        unset beresp.http.Set-Cookie;
    }
}

sub vcl_deliver {
    # Ensure X-Robots-Tag is on every delivery (including cache hits)
    if (!resp.http.X-Robots-Tag) {
        set resp.http.X-Robots-Tag = "noai, noimageai";
    }

    # Add cache status header for debugging
    if (obj.hits > 0) {
        set resp.http.X-Cache = "HIT";
    } else {
        set resp.http.X-Cache = "MISS";
    }

    # Remove Varnish internals from response
    unset resp.http.X-Varnish;
    unset resp.http.Via;
}

sub vcl_synth {
    # robots.txt (custom status 800)
    if (resp.status == 800) {
        set resp.status = 200;
        set resp.http.Content-Type = "text/plain; charset=utf-8";
        set resp.http.Cache-Control = "public, max-age=86400";
        synthetic({"User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: AhrefsBot
Disallow: /

Sitemap: https://example.com/sitemap.xml
"});
        return(deliver);
    }

    # 403 Forbidden
    if (resp.status == 403) {
        set resp.http.Content-Type = "text/plain; charset=utf-8";
        synthetic("Forbidden");
        return(deliver);
    }

    # 429 Too Many Requests
    if (resp.status == 429) {
        set resp.http.Content-Type = "text/plain; charset=utf-8";
        set resp.http.Retry-After = "60";
        synthetic("Too Many Requests");
        return(deliver);
    }

    return(deliver);
}
Custom synth status 800: Using status code 800 for the robots.txt synth avoids conflicting with a real 200 response in vcl_synth. Varnish allows any status code in synth() — using a code outside the standard 200–599 range is a common pattern for internal routing logic. Set it back to 200 in vcl_synth before delivering.

Docker deployment

docker-compose.yml

services:
  varnish:
    image: varnish:7.5-alpine
    ports:
      - "80:80"
      - "8443:8443"
    volumes:
      - ./default.vcl:/etc/varnish/default.vcl:ro
    environment:
      - VARNISH_SIZE=256m
    command: >
      -a 0.0.0.0:80,HTTP
      -f /etc/varnish/default.vcl
      -s malloc,256m
    depends_on:
      - app

  app:
    image: your-app:latest
    expose:
      - "8080"

# For HTTPS: put nginx or caddy in front of varnish for TLS termination
# Varnish does not handle TLS natively in the open-source version
Varnish and TLS: Varnish open source does not terminate TLS. For HTTPS, place a TLS-terminating proxy (nginx, Caddy, HAProxy) in front of Varnish. Varnish Enterprise includes the Hitch TLS proxy. Common pattern: Client → nginx (TLS) → Varnish (cache + bot blocking) → app backend.

Reload VCL without restart

# Load new VCL
varnishadm vcl.load newconfig /etc/varnish/default.vcl

# Activate it
varnishadm vcl.use newconfig

# Verify
varnishadm vcl.list

Inspect blocked requests

# Watch all VCL log messages in real time
varnishlog -g request -q "VCL_Log ~ "AI bot""

# Count blocked bot requests
varnishstat -f MAIN.synth

FAQ

How do I block AI bots by User-Agent in Varnish?

In vcl_recv, use req.http.User-Agent ~ "(?i)(GPTBot|ClaudeBot|...)" then return(synth(403, "Forbidden")). The ~ operator does PCRE regex matching; (?i) makes it case-insensitive.

What is the difference between vcl_recv and vcl_pass in Varnish?

vcl_recv runs for every incoming request before cache lookup — the correct place for bot blocking. vcl_pass runs when a request is explicitly passed to the backend (bypassing cache). Block in vcl_recv so all requests are checked, cached or not.

How do I add X-Robots-Tag in Varnish?

In vcl_backend_response: set beresp.http.X-Robots-Tag = "noai, noimageai" — cached with the object. Or in vcl_deliver: set resp.http.X-Robots-Tag = "noai, noimageai" — applied on every delivery including cache hits, not stored in cache.

Can Varnish serve robots.txt without hitting the backend?

Yes — detect req.url == "/robots.txt" in vcl_recv and call return(synth(800, "robots")). In vcl_synth, set resp.status = 200, set the Content-Type, and use synthetic() with the robots.txt content.

How do I rate-limit bots in Varnish?

Install the varnish-modules package for the vsthrottle VMOD. In vcl_recv: vsthrottle.is_denied(req.http.X-Forwarded-For, 100, 10s) returns true if the client exceeded 100 requests in 10 seconds. Return synth(429) if denied.

Should I block bots in vcl_recv or at the backend level?

Always in vcl_recv — it fires before cache lookup and before any backend connection. Blocking here means zero backend load from blocked bots. Backend-level blocking wastes a connection and thread for every blocked request.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.