Skip to content
Guides/HAProxy
HAProxy · Reverse Proxy · Load Balancer9 min read

How to Block AI Bots on HAProxy: Complete 2026 Guide

HAProxy is a high-performance TCP/HTTP load balancer and reverse proxy used in production by GitHub, Reddit, Airbnb, and many high-traffic sites. Unlike nginx or Apache, HAProxy's configuration is ACL-driven — bot blocking is concise and fast. This guide covers ACL-based UA matching, X-Robots-Tag headers, robots.txt serving, rate limiting via stick tables, and logging.

ACL-based UA blocking

HAProxy uses ACLs (Access Control Lists) in the frontend block to match request properties. Block AI bots before the request reaches any backend:

frontend http-in
    bind *:80
    bind *:443 ssl crt /etc/haproxy/certs/

    # AI bot blocking — substring match, case-insensitive
    acl is_ai_bot req.hdr(User-Agent) -m sub -i \
        GPTBot \
        ClaudeBot \
        anthropic-ai \
        CCBot \
        Google-Extended \
        AhrefsBot \
        Bytespider \
        Amazonbot \
        Diffbot \
        FacebookBot \
        cohere-ai \
        PerplexityBot \
        YouBot

    http-request deny status 403 if is_ai_bot

    default_backend app
Block in the frontend, not the backend. Placing http-request deny in the frontend stops the request before it reaches the backend, saving backend resources (connections, threads, DB queries). Backend rules only fire after HAProxy has selected a server — the request has already consumed resources by then.
Line continuation with backslash: HAProxy config supports line continuation with \. The ACL values above are space-separated — each additional value on the same acl line is OR logic. Alternatively, list each on a separate acl is_ai_bot line (repeated ACL name = OR logic too).

Alternative — one value per line (equivalent)

acl is_ai_bot req.hdr(User-Agent) -m sub -i GPTBot
acl is_ai_bot req.hdr(User-Agent) -m sub -i ClaudeBot
acl is_ai_bot req.hdr(User-Agent) -m sub -i anthropic-ai
acl is_ai_bot req.hdr(User-Agent) -m sub -i CCBot
acl is_ai_bot req.hdr(User-Agent) -m sub -i Google-Extended

http-request deny status 403 if is_ai_bot

Repeated acl lines with the same name are ORed together — same result as the multi-value single line.

ACL match flags: -m sub vs -m reg

HAProxy ACLs support multiple match methods. The two most useful for bot blocking:

FlagMethodUse casePerformance
-m subSubstring matchBot name appears anywhere in UA stringFast — string search
-m regRegex matchComplex patterns, anchoringSlower — regex engine
-m strExact matchExact UA string equalityFastest — hash lookup
-m begPrefix matchUA starts with patternFast

-i makes any match case-insensitive. Always use -i for User-Agent matching — bots sometimes vary their capitalisation across versions.

Regex example (more precise but slower)

acl is_ai_bot req.hdr(User-Agent) -m reg -i \
    (GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)

http-request deny status 403 if is_ai_bot
For most deployments, -m sub -i with a list of bot names is the best balance of clarity, performance, and coverage. Use -m reg only if you need anchoring (^/$) or more complex pattern logic.

X-Robots-Tag response header

Use http-response set-header to add X-Robots-Tag to all responses. Place in the backend block so it applies to responses from your upstream server (not to HAProxy-generated error pages):

backend app
    balance roundrobin
    option forwardfor
    http-response set-header X-Robots-Tag "noai, noimageai"
    server app1 127.0.0.1:3000 check
    server app2 127.0.0.1:3001 check

Apply to all responses including error pages (frontend placement)

frontend http-in
    bind *:80

    # Apply X-Robots-Tag to everything, including HAProxy error pages
    http-response set-header X-Robots-Tag "noai, noimageai"

    default_backend app
Frontend vs backend placement: In the frontend, http-response set-header applies to all responses including HAProxy- generated 400/403/503 pages. In the backend, it only applies to responses proxied from your upstream server. For SEO headers, backend placement is usually preferable — you don't need X-Robots-Tag on error responses.

Add to existing header (if upstream already sets it)

# Replace the header entirely (preferred — avoid duplicate headers)
http-response set-header X-Robots-Tag "noai, noimageai"

# Or add to existing value
http-response add-header X-Robots-Tag "noai, noimageai"

Serving robots.txt from HAProxy

HAProxy can serve a static robots.txt response directly from the frontend, without forwarding the request to the backend:

frontend http-in
    bind *:80
    bind *:443 ssl crt /etc/haproxy/certs/

    # Serve robots.txt directly from HAProxy
    acl is_robots_txt path /robots.txt
    http-request return status 200 \
        content-type "text/plain" \
        string "User-agent: *\nAllow: /\n\nUser-agent: GPTBot\nDisallow: /\n\nUser-agent: ClaudeBot\nDisallow: /\n\nUser-agent: anthropic-ai\nDisallow: /\n\nUser-agent: CCBot\nDisallow: /\n\nUser-agent: Google-Extended\nDisallow: /\n" \
        if is_robots_txt

    default_backend app
http-request return was introduced in HAProxy 2.2. If you're on an older version, use http-request redirect to a static file served by your backend, or upgrade to HAProxy 2.2+.

Serve robots.txt from a file (HAProxy 2.4+)

# haproxy.cfg — HAProxy 2.4+
frontend http-in
    bind *:80

    acl is_robots_txt path /robots.txt
    http-request return status 200 content-type "text/plain" file /etc/haproxy/robots.txt if is_robots_txt

    default_backend app

Create /etc/haproxy/robots.txt:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

Rate limiting with stick tables

HAProxy's stick tables provide per-IP rate limiting without any additional software. This complements UA blocking — limiting aggressive crawlers even if they spoof their User-Agent:

backend rate_limit_table
    stick-table type ip size 1m expire 10m store gpc0,http_req_rate(10s)

frontend http-in
    bind *:80

    # Track request rate per IP using the stick table in rate_limit_table backend
    http-request track-sc0 src table rate_limit_table

    # Block IPs making more than 100 requests in 10 seconds
    acl too_many_requests sc_http_req_rate(0) gt 100
    http-request deny status 429 if too_many_requests

    # Block known AI bots by UA (before rate check)
    acl is_ai_bot req.hdr(User-Agent) -m sub -i \
        GPTBot ClaudeBot anthropic-ai CCBot Google-Extended \
        AhrefsBot Bytespider Amazonbot Diffbot PerplexityBot

    http-request deny status 403 if is_ai_bot

    default_backend app

Key stick table options

OptionDescription
type ipKey by client IP address
size 1mStore up to 1 million entries
expire 10mRemove entries after 10 minutes of inactivity
store gpc0General Purpose Counter 0 — for manual counters
store http_req_rate(10s)Track request rate over a 10-second sliding window

Logging blocked bots

Log blocked bot requests to a separate file for analysis without polluting your main access log:

global
    log /dev/log local0
    log /dev/log local1 notice

defaults
    log     global
    option  httplog
    option  dontlognull

frontend http-in
    bind *:80

    acl is_ai_bot req.hdr(User-Agent) -m sub -i \
        GPTBot ClaudeBot anthropic-ai CCBot Google-Extended \
        AhrefsBot Bytespider Amazonbot Diffbot PerplexityBot

    # Capture User-Agent for logging (first 100 chars)
    http-request capture req.hdr(User-Agent) len 100

    # Add custom log tag for blocked bots
    http-request set-log-level warning if is_ai_bot
    http-request deny status 403 if is_ai_bot

    default_backend app

Custom log format showing blocked bot UA

defaults
    log-format "%ci:%cp [%t] %ft %b/%s %Tq/%Tw/%Tc/%Tr/%Tt %ST %B %tsc %ac/%fc/%bc/%sc/%rc %{+Q}r %[capture.req.hdr(0)]"

The %[capture.req.hdr(0)] field outputs the first captured header (the User-Agent in the config above), making it easy to grep blocked bot names from logs.

Full haproxy.cfg example

global
    log /dev/log local0
    log /dev/log local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon
    maxconn 50000

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  forwardfor
    option  http-server-close
    timeout connect 5s
    timeout client  30s
    timeout server  30s
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 429 /etc/haproxy/errors/429.http
    errorfile 503 /etc/haproxy/errors/503.http

# Stick table for rate limiting (backend with no servers = pure table)
backend rate_limit_table
    stick-table type ip size 1m expire 10m store gpc0,http_req_rate(10s)

frontend http-in
    bind *:80
    bind *:443 ssl crt /etc/haproxy/certs/example.com.pem alpn h2,http/1.1

    # Redirect HTTP to HTTPS
    http-request redirect scheme https unless { ssl_fc }

    # Capture User-Agent for logging
    http-request capture req.hdr(User-Agent) len 100

    # Rate limiting — track requests per IP
    http-request track-sc0 src table rate_limit_table
    acl too_many_requests sc_http_req_rate(0) gt 100
    http-request deny status 429 if too_many_requests

    # AI bot blocking by User-Agent
    acl is_ai_bot req.hdr(User-Agent) -m sub -i \
        GPTBot \
        ClaudeBot \
        anthropic-ai \
        CCBot \
        Google-Extended \
        AhrefsBot \
        Bytespider \
        Amazonbot \
        Diffbot \
        FacebookBot \
        cohere-ai \
        PerplexityBot \
        YouBot

    http-request set-log-level warning if is_ai_bot
    http-request deny status 403 if is_ai_bot

    # Serve robots.txt directly (HAProxy 2.4+)
    acl is_robots_txt path /robots.txt
    http-request return status 200 content-type "text/plain" \
        file /etc/haproxy/robots.txt if is_robots_txt

    # ACME challenge passthrough (if using Let's Encrypt)
    acl is_acme path_beg /.well-known/acme-challenge/
    use_backend acme_backend if is_acme

    default_backend app

backend app
    balance leastconn
    option httpchk GET /health
    http-response set-header X-Robots-Tag "noai, noimageai"
    http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
    server app1 127.0.0.1:3000 check inter 5s rise 2 fall 3
    server app2 127.0.0.1:3001 check inter 5s rise 2 fall 3

backend acme_backend
    server acme 127.0.0.1:8080

# Stats page (disable in production or restrict to internal IPs)
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    acl local_net src 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16
    stats http-request deny unless local_net

Docker deployment

docker-compose.yml

services:
  haproxy:
    image: haproxy:2.8-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro
      - ./certs:/etc/haproxy/certs:ro
      - ./robots.txt:/etc/haproxy/robots.txt:ro
      - haproxy_run:/run/haproxy
    depends_on:
      - app
    restart: unless-stopped

  app:
    image: your-app:latest
    expose:
      - "3000"
    restart: unless-stopped

volumes:
  haproxy_run:
HAProxy version: Use HAProxy 2.8 LTS (or 2.6 LTS) for production. HAProxy 2.4+ is required for http-request return ... file. HAProxy 2.2+ is required for http-request return (inline string). Check your version with haproxy -v.

Reload config without downtime

# Send SIGUSR2 for graceful reload (HAProxy 1.8+)
docker kill --signal=SIGUSR2 haproxy_container

# Or use the admin socket
echo "reload" | socat stdio /run/haproxy/admin.sock

FAQ

How do I block AI bots by User-Agent in HAProxy?

Define an ACL in the frontend block using req.hdr(User-Agent) -m sub -i for case-insensitive substring matching, then http-request deny status 403 if is_ai_bot. List multiple bot names space-separated on one ACL line (OR logic).

What is the difference between -m sub and -m reg in HAProxy ACLs?

-m sub does substring matching — efficient for simple bot name checks. -m reg uses regular expressions — more flexible but slower. For bot blocking, -m sub is preferred: list each bot name on the same ACL line (space-separated = OR) for fast matching without regex overhead.

How do I add X-Robots-Tag in HAProxy?

Use http-response set-header X-Robots-Tag "noai, noimageai" in the backend block. Backend placement applies only to proxied responses (not HAProxy error pages). Frontend placement applies to all responses including error pages.

Can HAProxy serve robots.txt directly without a backend?

Yes — use http-request return status 200 content-type "text/plain" file /etc/haproxy/robots.txt (HAProxy 2.4+) or an inline string (HAProxy 2.2+). Define an ACL for path /robots.txt and return before forwarding to the backend.

How do I rate-limit AI bots in HAProxy?

Use stick tables with http_req_rate(10s). Define a backend with just a stick-table (no servers), track requests per IP in the frontend with http-request track-sc0 src table rate_limit_table, then deny when sc_http_req_rate(0) gt 100.

Should I block AI bots in the frontend or backend in HAProxy?

Frontend — it stops the request before it reaches the backend, saving backend resources (connections, threads, DB queries). Use http-request deny in the frontend block. X-Robots-Tag headers go in the backend, so they only apply to proxied responses.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.