Skip to content
Nginx · Web Server · Reverse Proxy·9 min read

How to Block AI Bots on Nginx: Complete 2026 Guide

Nginx sits in front of everything — it's the first layer your traffic touches, which makes it the most powerful place to block AI crawlers. Whether you're serving a static site directly or running nginx as a reverse proxy in front of Node, Python, or PHP, the bot-blocking config is the same: a map block in your http {} context, a return 403 before the request reaches your origin, and add_header X-Robots-Tag on all responses.

The map block must go in http {} — not server {}

The most common nginx bot-blocking mistake: placing the map directive inside a server or location block. Nginx will refuse to start with a config error. The map directive belongs in the http {} context, defined once globally, then used inside any number of server blocks.

Methods at a glance

MethodWhat it doesWhere it lives
robots.txt location blockSignals bots which paths are off-limitsWebroot / root directive
map + if ($bad_bot)Hard 403 on known AI User-Agentshttp {} then server {}
add_header X-Robots-Tagnoai header on all HTTP responsesserver {} or location {}
noai <meta> tagAI training opt-out per HTML pageHTML files / layout template
limit_req_zoneRate-limit to slow bot scrapinghttp {} then location {}
geo blockIP-range blocking (no if needed)http {} context

1. robots.txt — location block

Nginx serves files from the directory set by the root directive (e.g. /var/www/html). Place robots.txt in that directory, then add a dedicated location block so nginx handles it cleanly — no PHP, no upstream, no access log noise.

# nginx server block
server {
    listen 443 ssl;
    server_name example.com;
    root /var/www/html;

    # Exact-match location for robots.txt — fastest evaluation
    location = /robots.txt {
        try_files $uri =404;
        access_log  off;       # don't pollute access logs
        log_not_found off;     # don't log 404 if absent
        expires     1d;
        add_header  Cache-Control "public, max-age=86400";
    }
}

Your robots.txt should explicitly disallow AI training crawlers:

# /var/www/html/robots.txt

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /

robots.txt is advisory — compliant bots will respect it, aggressive scrapers will not. Use the map block below for hard enforcement.

2. Hard 403 blocking — map block

The map directive matches $http_user_agent against a list of patterns and sets a variable. Nginx evaluates map lazily — only when the variable is first used — so it adds no overhead for normal requests. The map block must be inside http {}, not inside server {} or location {}.

# /etc/nginx/nginx.conf  (or included .conf in http block)

http {
    # ── AI bot User-Agent map ──────────────────────────────────────────
    # Must be inside http {}, NOT inside server {} or location {}
    map $http_user_agent $bad_bot {
        default          0;       # allow everything by default
        ~*GPTBot         1;
        ~*ChatGPT-User   1;
        ~*ClaudeBot      1;
        ~*Claude-Web     1;
        ~*anthropic-ai   1;
        ~*CCBot          1;
        ~*Google-Extended 1;
        ~*PerplexityBot  1;
        ~*Amazonbot      1;
        ~*Bytespider     1;
        ~*YouBot         1;
        ~*Applebot       1;
        ~*DuckAssistBot  1;
        ~*meta-externalagent 1;
        ~*MistralAI-Spider 1;
        ~*oai-searchbot  1;
    }

    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;
}

Inside each server block, check $bad_bot and return 403. Always exempt /robots.txt so compliant bots can still read your directives:

server {
    listen 443 ssl;
    server_name example.com;
    root /var/www/html;

    # ── robots.txt — exempt from bot blocking ──────────────────────────
    location = /robots.txt {
        try_files $uri =404;
        access_log off;
        log_not_found off;
    }

    # ── Block known AI bots ────────────────────────────────────────────
    # "if" is safe here — we're only returning a status, not using
    # proxy_pass, rewrite, or other directives that interact poorly with if
    location / {
        if ($bad_bot) {
            return 403 "Forbidden";
        }

        # ... your normal config (try_files, proxy_pass, etc.)
        try_files $uri $uri/ /index.html;
    }
}

On using if in nginx

Nginx docs warn against if because it interacts badly with proxy_pass and rewrite. For a pure return 403 with no other directives in the same block, if is safe and correct. If you want to avoid if entirely, use a geo block for IP-based blocking instead (see Section 5).

3. noai meta tag — static HTML

Nginx does not modify HTML content — it serves files as-is. For a static site, add the noai meta tag directly to every HTML file, or (better) to your base layout template in your SSG of choice (Hugo, Eleventy, Jekyll, Astro).

<!-- In your HTML <head> -->
<meta name="robots" content="noai, noimageai">

<!-- Or combined with other directives: -->
<meta name="robots" content="index, follow, noai, noimageai">

For SSG base layout templates:

<!-- Hugo: layouts/_default/baseof.html -->
<head>
  <meta name="robots" content="{{ with .Params.robots }}{{ . }}{{ else }}noai, noimageai{{ end }}">
</head>

<!-- Eleventy: _includes/base.njk -->
<head>
  <meta name="robots" content="{{ robots | default('noai, noimageai') }}">
</head>

<!-- Jekyll: _layouts/default.html -->
<head>
  <meta name="robots" content="{{ page.robots | default: 'noai, noimageai' }}">
</head>

The HTTP-layer equivalent is X-Robots-Tag (Section 4) — set via nginx add_header, no HTML changes needed.

4. X-Robots-Tag — add_header

X-Robots-Tag is the HTTP-header equivalent of the noai meta tag — useful for non-HTML resources (PDFs, images, API responses) and for sites where you can't easily modify HTML. The always keyword is critical: without it nginx only sends the header on 2xx/3xx responses.

server {
    listen 443 ssl;
    server_name example.com;

    # Add X-Robots-Tag to ALL responses (including 4xx/5xx)
    # "always" is required — without it, header is only sent on 2xx/3xx
    add_header X-Robots-Tag "noai, noimageai" always;

    # For HTML pages only (skip on API/JSON endpoints):
    location ~* .html$ {
        add_header X-Robots-Tag "noai, noimageai" always;
        try_files $uri =404;
    }
}

add_header inheritance gotcha

In nginx, if a block defines any add_header directive, it replaces (not appends to) all inherited add_header directives from parent blocks. If your location blocks already have add_header directives (e.g. CORS headers), repeat the X-Robots-Tag header in those blocks too, or use ngx_http_headers_more_module (more_set_headers) which appends instead.

5. Rate limiting — limit_req_zone

Rate limiting catches scrapers that rotate User-Agents or use unknown bot strings. The limit_req_zone directive lives in http {}; the limit_req directive applies it inside location {}.

# In http {} block:

# 10 MB zone keyed by client IP, max 10 requests/second
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;

# Stricter zone for paths that bots love to hammer
limit_req_zone $binary_remote_addr zone=content:10m rate=2r/s;

# Apply in server block:
server {
    listen 443 ssl;

    location / {
        limit_req zone=general burst=20 nodelay;
        limit_req_status 429;  # Return 429 Too Many Requests

        if ($bad_bot) { return 403; }
        try_files $uri $uri/ /index.html;
    }

    # Stricter limit on content-heavy paths
    location /blog {
        limit_req zone=content burst=5 nodelay;
        limit_req_status 429;
        try_files $uri $uri/ =404;
    }
}

burst allows short traffic spikes above the rate; nodelay processes burst requests immediately (vs queuing them). Without limit_req_status, nginx returns 503 — set it to 429 for correct semantics.

6. Reverse proxy setup

When nginx fronts a Node, Python, PHP, or other upstream server, the bot check fires before proxy_pass — blocked requests never reach your origin. This is the most effective architecture for high-traffic sites: nginx handles the rejection at near-zero cost.

server {
    listen 443 ssl;
    server_name example.com;

    # Headers for upstream to identify real client IP
    proxy_set_header Host              $host;
    proxy_set_header X-Real-IP         $remote_addr;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    add_header X-Robots-Tag "noai, noimageai" always;

    location = /robots.txt {
        root /var/www/html;
        try_files $uri =404;
        access_log off;
    }

    location / {
        # Bot check fires BEFORE proxy_pass — blocked bots never reach origin
        if ($bad_bot) {
            return 403 "Forbidden";
        }

        proxy_pass         http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header   Upgrade    $http_upgrade;
        proxy_set_header   Connection "upgrade";
        proxy_read_timeout 60s;
    }
}

7. Full nginx.conf example

A complete production-ready config combining all techniques above — map block in http {}, 403 blocking, robots.txt, X-Robots-Tag, and rate limiting. Works for both static sites and reverse proxy setups.

# /etc/nginx/nginx.conf

user  nginx;
worker_processes  auto;
error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;

events {
    worker_connections 1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;
    sendfile      on;
    keepalive_timeout 65;

    # ── AI bot User-Agent map ──────────────────────────────────────────
    # MUST be in http {} — not in server {} or location {}
    map $http_user_agent $bad_bot {
        default              0;
        ~*GPTBot             1;
        ~*ChatGPT-User       1;
        ~*ClaudeBot          1;
        ~*Claude-Web         1;
        ~*anthropic-ai       1;
        ~*CCBot              1;
        ~*Google-Extended    1;
        ~*PerplexityBot      1;
        ~*Amazonbot          1;
        ~*Bytespider         1;
        ~*YouBot             1;
        ~*Applebot           1;
        ~*DuckAssistBot      1;
        ~*meta-externalagent 1;
        ~*MistralAI-Spider   1;
        ~*oai-searchbot      1;
    }

    # ── Rate limiting zones ─────────────────────────────────────────────
    limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;

    # ── Redirect HTTP → HTTPS ───────────────────────────────────────────
    server {
        listen 80;
        server_name example.com www.example.com;
        return 301 https://example.com$request_uri;
    }

    # ── Main HTTPS server ───────────────────────────────────────────────
    server {
        listen 443 ssl http2;
        server_name example.com;
        root /var/www/html;

        ssl_certificate     /etc/letsencrypt/live/example.com/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem;
        ssl_protocols       TLSv1.2 TLSv1.3;

        # X-Robots-Tag on all responses
        add_header X-Robots-Tag "noai, noimageai" always;

        # ── robots.txt ──────────────────────────────────────────────────
        location = /robots.txt {
            try_files $uri =404;
            access_log    off;
            log_not_found off;
            expires       1d;
        }

        # ── All other requests ───────────────────────────────────────────
        location / {
            limit_req zone=general burst=20 nodelay;
            limit_req_status 429;

            # Block known AI bots — fires before proxy_pass / try_files
            if ($bad_bot) {
                return 403 "Forbidden";
            }

            # Static site:
            try_files $uri $uri/ /index.html;

            # Reverse proxy (comment out try_files, uncomment these):
            # proxy_pass         http://127.0.0.1:3000;
            # proxy_http_version 1.1;
            # proxy_set_header   Host              $host;
            # proxy_set_header   X-Real-IP         $remote_addr;
            # proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
            # proxy_set_header   X-Forwarded-Proto $scheme;
        }
    }
}

8. Docker deployment

Mount your nginx config and webroot as volumes, or bake them into the image for immutable deployments. The official nginx:alpine image is the standard choice — ~25 MB.

# Dockerfile — baked config (immutable, good for CI/CD)
FROM nginx:alpine

# Remove default config
RUN rm /etc/nginx/conf.d/default.conf

# Copy your config and webroot
COPY nginx.conf /etc/nginx/nginx.conf
COPY dist/       /var/www/html/

EXPOSE 80 443
CMD ["nginx", "-g", "daemon off;"]
# docker-compose.yml — volume-mounted config (easier to update)
services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./dist:/var/www/html:ro
      - ./certs:/etc/letsencrypt:ro
    restart: unless-stopped

Test your config before reloading — always:

# Inside the container:
nginx -t                  # test config syntax
nginx -s reload           # reload without downtime

# From host:
docker exec nginx-container nginx -t
docker exec nginx-container nginx -s reload

9. Ubuntu / Debian setup

On a bare-metal or VPS server, split your config across /etc/nginx/conf.d/ files for clarity. Keep the map block in a dedicated file included from the main nginx.conf.

# Install nginx
sudo apt update && sudo apt install -y nginx

# Create the bot map config (included from nginx.conf http block)
sudo tee /etc/nginx/conf.d/bot-map.conf > /dev/null <<'EOF'
map $http_user_agent $bad_bot {
    default              0;
    ~*GPTBot             1;
    ~*ClaudeBot          1;
    ~*CCBot              1;
    ~*Google-Extended    1;
    ~*PerplexityBot      1;
    ~*Amazonbot          1;
    ~*Bytespider         1;
}
EOF

# Create site config
sudo tee /etc/nginx/sites-available/example.com > /dev/null <<'EOF'
server {
    listen 80;
    server_name example.com;
    root /var/www/html/example.com;
    add_header X-Robots-Tag "noai, noimageai" always;

    location = /robots.txt {
        try_files $uri =404;
        access_log off;
    }

    location / {
        if ($bad_bot) { return 403; }
        try_files $uri $uri/ /index.html;
    }
}
EOF

# Enable site
sudo ln -s /etc/nginx/sites-available/example.com /etc/nginx/sites-enabled/

# Test and reload
sudo nginx -t && sudo systemctl reload nginx

Frequently asked questions

Where does the map block go in nginx.conf?

Inside http {} — not inside server {} or location {}. A map directive at the server or location level causes an nginx config error on startup. Keep it in /etc/nginx/conf.d/bot-map.conf (included from the http block) for clean organisation.

Is "if ($bad_bot)" safe to use in nginx?

Yes, when used only to return a status code. Nginx if is dangerous when combined with proxy_pass, rewrite, or set — not for a plain return 403. If you want to avoid if entirely, use a geo block for IP-based blocking instead.

Does add_header X-Robots-Tag work for error responses?

Only with the always keyword: add_header X-Robots-Tag "noai, noimageai" always. Without always, nginx sends the header only on 2xx and 3xx responses — 4xx and 5xx responses omit it. Also remember the inheritance rule: a child block with any add_header replaces all inherited ones.

How do I block bots on nginx without if?

Use a geo block (in http {}) to map IP ranges to a variable, then check that variable. For User-Agent-based blocking without if, you can use a map + a named location with return 403, redirected from the main location via try_files — but this is more complex than the simple if ($bad_bot) pattern which is safe for this use case.

Does nginx bot blocking work as a reverse proxy?

Yes — and it's the most effective placement. The if ($bad_bot) { return 403; } check fires before proxy_pass, so blocked bots never reach your Node/Python/PHP upstream. This reduces origin load and protects your app server from bot traffic at near-zero nginx overhead.

How do I add noai meta tags on a static site served by nginx?

Nginx serves HTML files as-is — it doesn't inject content. Add <meta name="robots" content="noai, noimageai"> to the <head> of your HTML files, or to the base layout in your SSG (Hugo, Eleventy, Jekyll). The HTTP-layer alternative is add_header X-Robots-Tag— no HTML edits needed.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.