Skip to content
Guides/Lighttpd
Lighttpd · Web Server · VPS · Embedded Linux8 min read

How to Block AI Bots on Lighttpd: Complete 2026 Guide

Lighttpd ("lighty") is a lightweight, event-driven web server optimised for high-concurrency, low-memory environments — popular on VPS instances, Raspberry Pi, and embedded Linux. Its module-based config uses conditional blocks that make bot blocking concise and readable.

Required modules

Lighttpd uses a module system. All modules must be listed in server.modules in lighttpd.conf before their directives can be used. The relevant modules for bot blocking are bundled with Lighttpd — no separate installation required:

server.modules = (
    "mod_access",      # url.access-deny — required for UA blocking
    "mod_setenv",      # setenv.add-response-header — required for X-Robots-Tag
    "mod_rewrite",     # url.rewrite-once — optional, for URL manipulation
    "mod_redirect",    # url.redirect — optional
    "mod_accesslog",   # access logging
    "mod_fastcgi",     # if using PHP/Python via FastCGI
    "mod_proxy",       # if using as reverse proxy
)
Order matters: Modules are loaded in the order listed. mod_access should be early in the list — it processes requests before they reach content handlers. If you use mod_rewrite, place it before mod_access if rewrites should happen before access checks, or after if access checks should fire first.

User-Agent blocking with mod_access

Lighttpd uses conditional blocks ($HTTP["useragent"]) to match request headers. Inside a match, url.access-deny denies the request with a 403.

# lighttpd.conf
# Block AI training and scraping bots by User-Agent
$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot" {
    url.access-deny = ("")
}
Regex matching: The =~ operator does POSIX extended regex matching. Alternatives separated by |. The match is case-sensitive by default — use =~* for case-insensitive matching (Lighttpd 1.4.46+), or add (?i) at the start of the pattern for older versions.

Case-insensitive matching (Lighttpd 1.4.46+)

# =~* operator for case-insensitive regex (Lighttpd 1.4.46+)
$HTTP["useragent"] =~* "gptbot|claudebot|anthropic-ai|ccbot|google-extended|ahrefsbot|bytespider|amazonbot|diffbot|facebookbot|cohere-ai|perplexitybot|youbot" {
    url.access-deny = ("")
}

Case-insensitive matching (older Lighttpd)

# (?i) flag for case-insensitive match (PCRE, older versions)
$HTTP["useragent"] =~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)" {
    url.access-deny = ("")
}
url.access-deny = (""): The empty string matches all URLs. This denies the entire request with a 403 Forbidden response. You can also deny specific paths: url.access-deny = ("/api/", "/admin/") — but for bot blocking, denying everything is the correct approach.

X-Robots-Tag with mod_setenv

Use mod_setenv to add response headers globally. Place this after themod_setenv entry in server.modules:

# lighttpd.conf
setenv.add-response-header = (
    "X-Robots-Tag" => "noai, noimageai"
)

Multiple headers

setenv.add-response-header = (
    "X-Robots-Tag"           => "noai, noimageai",
    "X-Content-Type-Options" => "nosniff",
    "X-Frame-Options"        => "SAMEORIGIN",
    "Referrer-Policy"        => "strict-origin-when-cross-origin"
)
setenv.add-response-header vs setenv.set-response-header: add-response-header appends to any existing header with the same name (can create duplicates if the upstream also sets it). set-response-header (Lighttpd 1.4.46+) replaces the existing value. For X-Robots-Tag, prefer set-response-header if your backend might also set it.

robots.txt as a static file

Place robots.txt in your document root (configured by server.document-root in lighttpd.conf). Lighttpd serves all static files from the document root by default — no additional configuration needed.

# lighttpd.conf
server.document-root = "/var/www/html"
# robots.txt goes at: /var/www/html/robots.txt
# /var/www/html/robots.txt
User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: YouBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Conditional blocks — path-specific rules

Lighttpd's conditional blocks can be nested. Block AI bots only on specific paths (e.g. protect an API or blog while allowing crawling of marketing pages):

Block bots site-wide (recommended)

$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended" {
    url.access-deny = ("")
}

Block bots only under /blog/ and /docs/

$HTTP["url"] =~ "^/(blog|docs)/" {
    $HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended" {
        url.access-deny = ("")
    }
}

Allow a specific bot while blocking others

# Block all AI bots EXCEPT Googlebot
$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|AhrefsBot|Bytespider" {
    $HTTP["useragent"] !~ "Googlebot" {
        url.access-deny = ("")
    }
}
Conditional operators:
  • =~ — matches regex (case-sensitive)
  • =~* — matches regex (case-insensitive, 1.4.46+)
  • !~ — does not match regex
  • == — exact string match
  • != — not equal
Conditions can be nested up to 3 levels deep.

Rate limiting options

Lighttpd does not have built-in request rate limiting like nginx or HAProxy. Options:

Option 1: Connection limiting (built-in)

# Limit concurrent connections per IP
server.max-connections = 1024

# Per-IP connection limiting (Lighttpd 1.4.46+)
$HTTP["remoteip"] !~ "^(127\.0\.0\.1|10\.).*" {
    connection.limit = 20
}

Option 2: iptables rate limiting (OS level)

# Limit each IP to 60 new connections per minute to port 80/443
iptables -A INPUT -p tcp --dport 80 -m state --state NEW -m limit --limit 60/min --limit-burst 20 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -m state --state NEW -j DROP
ip6tables -A INPUT -p tcp --dport 443 -m state --state NEW -m limit --limit 60/min --limit-burst 20 -j ACCEPT

Option 3: fail2ban integration

# /etc/fail2ban/filter.d/lighttpd-bot.conf
[Definition]
failregex = ^<HOST> .* "(GET|POST|HEAD) .* HTTP/.*" 403
ignoreregex =

# /etc/fail2ban/jail.local
[lighttpd-bot]
enabled  = true
port     = http,https
filter   = lighttpd-bot
logpath  = /var/log/lighttpd/access.log
maxretry = 10
findtime = 60
bantime  = 3600

Full lighttpd.conf example

# /etc/lighttpd/lighttpd.conf

server.modules = (
    "mod_access",
    "mod_setenv",
    "mod_accesslog",
    "mod_rewrite",
    "mod_redirect",
    "mod_compress",
    "mod_fastcgi",
)

# Basic server config
server.document-root = "/var/www/html"
server.port          = 80
server.bind          = "0.0.0.0"
server.username      = "www-data"
server.groupname     = "www-data"
server.pid-file      = "/run/lighttpd.pid"
server.errorlog      = "/var/log/lighttpd/error.log"

# Access logging
accesslog.filename   = "/var/log/lighttpd/access.log"
accesslog.format     = "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i""

# MIME types
mimetype.assign = (
    ".html"  => "text/html; charset=utf-8",
    ".css"   => "text/css",
    ".js"    => "application/javascript",
    ".json"  => "application/json",
    ".png"   => "image/png",
    ".jpg"   => "image/jpeg",
    ".svg"   => "image/svg+xml",
    ".woff2" => "font/woff2",
    ".txt"   => "text/plain",
    ".xml"   => "application/xml",
)

# Index files
index-file.names = ("index.html", "index.php")

# ── Bot blocking ─────────────────────────────────────────────────────────────

# Block AI training and scraping bots by User-Agent
$HTTP["useragent"] =~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)" {
    url.access-deny = ("")
}

# ── Response headers ─────────────────────────────────────────────────────────

setenv.add-response-header = (
    "X-Robots-Tag"           => "noai, noimageai",
    "X-Content-Type-Options" => "nosniff",
    "X-Frame-Options"        => "SAMEORIGIN",
    "Referrer-Policy"        => "strict-origin-when-cross-origin",
)

# ── Static file caching ───────────────────────────────────────────────────────

$HTTP["url"] =~ ".(css|js|png|jpg|jpeg|gif|ico|woff2|svg)$" {
    expire.url = ( "" => "access plus 1 months" )
}

# ── HTTPS redirect (if handling SSL offload) ──────────────────────────────────
# Typically done at the load balancer/proxy level
# $HTTP["scheme"] == "http" {
#     url.redirect = ( "^/(.*)" => "https://example.com/$1" )
# }

Test config and reload

# Test config syntax
lighttpd -t -f /etc/lighttpd/lighttpd.conf

# Reload (graceful — no dropped connections)
systemctl reload lighttpd

# Or send HUP signal
kill -HUP $(cat /run/lighttpd.pid)

Docker deployment

docker-compose.yml

services:
  lighttpd:
    image: sebp/lighttpd:latest
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./lighttpd.conf:/etc/lighttpd/lighttpd.conf:ro
      - ./html:/var/www/html:ro
      - ./ssl:/etc/lighttpd/ssl:ro
    restart: unless-stopped

Minimal Dockerfile

FROM debian:bookworm-slim

RUN apt-get update && apt-get install -y lighttpd && rm -rf /var/lib/apt/lists/*

COPY lighttpd.conf /etc/lighttpd/lighttpd.conf
COPY html/ /var/www/html/

EXPOSE 80

CMD ["lighttpd", "-D", "-f", "/etc/lighttpd/lighttpd.conf"]

FAQ

How do I block AI bots by User-Agent in Lighttpd?

Use mod_access with a $HTTP["useragent"] conditional block: $HTTP["useragent"] =~ "GPTBot|ClaudeBot|..." { url.access-deny = ("") }. The =~ operator does regex matching; =~* is case-insensitive (Lighttpd 1.4.46+).

What modules do I need to block AI bots in Lighttpd?

mod_access for url.access-deny and mod_setenv for setenv.add-response-header. Both are bundled with Lighttpd — just add them to server.modules in lighttpd.conf.

How do I add X-Robots-Tag in Lighttpd?

setenv.add-response-header = ("X-Robots-Tag" => "noai, noimageai") after loading mod_setenv. Use set-response-header (1.4.46+) instead of add-response-header if your backend might also set it, to avoid duplicates.

How do I serve robots.txt in Lighttpd?

Place robots.txt in server.document-root. Lighttpd serves static files from the document root automatically — no extra config needed.

Does Lighttpd support rate limiting?

Not built-in for request rate limiting. Options: connection limiting with connection.limit (per-IP, 1.4.46+), OS-level iptables rate limiting, or fail2ban parsing access logs. For advanced rate limiting, put Cloudflare or nginx in front of Lighttpd.

Can I use conditional blocks to block bots on specific paths?

Yes — nest conditions: $HTTP["url"] =~ "^/blog/" { $HTTP["useragent"] =~ "GPTBot" { url.access-deny = ("") } }. Conditions can be nested up to 3 levels deep. Use !~ to invert (block everything except a pattern).

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.