How to Block AI Bots on Lighttpd: Complete 2026 Guide
Lighttpd ("lighty") is a lightweight, event-driven web server optimised for high-concurrency, low-memory environments — popular on VPS instances, Raspberry Pi, and embedded Linux. Its module-based config uses conditional blocks that make bot blocking concise and readable.
Contents
Required modules
Lighttpd uses a module system. All modules must be listed in server.modules in lighttpd.conf before their directives can be used. The relevant modules for bot blocking are bundled with Lighttpd — no separate installation required:
server.modules = (
"mod_access", # url.access-deny — required for UA blocking
"mod_setenv", # setenv.add-response-header — required for X-Robots-Tag
"mod_rewrite", # url.rewrite-once — optional, for URL manipulation
"mod_redirect", # url.redirect — optional
"mod_accesslog", # access logging
"mod_fastcgi", # if using PHP/Python via FastCGI
"mod_proxy", # if using as reverse proxy
)mod_access should be early in the list — it processes requests before they reach content handlers. If you use mod_rewrite, place it before mod_access if rewrites should happen before access checks, or after if access checks should fire first.User-Agent blocking with mod_access
Lighttpd uses conditional blocks ($HTTP["useragent"]) to match request headers. Inside a match, url.access-deny denies the request with a 403.
# lighttpd.conf
# Block AI training and scraping bots by User-Agent
$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot" {
url.access-deny = ("")
}=~ operator does POSIX extended regex matching. Alternatives separated by |. The match is case-sensitive by default — use =~* for case-insensitive matching (Lighttpd 1.4.46+), or add (?i) at the start of the pattern for older versions.Case-insensitive matching (Lighttpd 1.4.46+)
# =~* operator for case-insensitive regex (Lighttpd 1.4.46+)
$HTTP["useragent"] =~* "gptbot|claudebot|anthropic-ai|ccbot|google-extended|ahrefsbot|bytespider|amazonbot|diffbot|facebookbot|cohere-ai|perplexitybot|youbot" {
url.access-deny = ("")
}Case-insensitive matching (older Lighttpd)
# (?i) flag for case-insensitive match (PCRE, older versions)
$HTTP["useragent"] =~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)" {
url.access-deny = ("")
}url.access-deny = ("/api/", "/admin/") — but for bot blocking, denying everything is the correct approach.X-Robots-Tag with mod_setenv
Use mod_setenv to add response headers globally. Place this after themod_setenv entry in server.modules:
# lighttpd.conf
setenv.add-response-header = (
"X-Robots-Tag" => "noai, noimageai"
)Multiple headers
setenv.add-response-header = (
"X-Robots-Tag" => "noai, noimageai",
"X-Content-Type-Options" => "nosniff",
"X-Frame-Options" => "SAMEORIGIN",
"Referrer-Policy" => "strict-origin-when-cross-origin"
)add-response-header appends to any existing header with the same name (can create duplicates if the upstream also sets it). set-response-header (Lighttpd 1.4.46+) replaces the existing value. For X-Robots-Tag, prefer set-response-header if your backend might also set it.robots.txt as a static file
Place robots.txt in your document root (configured by server.document-root in lighttpd.conf). Lighttpd serves all static files from the document root by default — no additional configuration needed.
# lighttpd.conf
server.document-root = "/var/www/html"
# robots.txt goes at: /var/www/html/robots.txt# /var/www/html/robots.txt
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: YouBot
Disallow: /
Sitemap: https://example.com/sitemap.xmlConditional blocks — path-specific rules
Lighttpd's conditional blocks can be nested. Block AI bots only on specific paths (e.g. protect an API or blog while allowing crawling of marketing pages):
Block bots site-wide (recommended)
$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended" {
url.access-deny = ("")
}Block bots only under /blog/ and /docs/
$HTTP["url"] =~ "^/(blog|docs)/" {
$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended" {
url.access-deny = ("")
}
}Allow a specific bot while blocking others
# Block all AI bots EXCEPT Googlebot
$HTTP["useragent"] =~ "GPTBot|ClaudeBot|anthropic-ai|CCBot|AhrefsBot|Bytespider" {
$HTTP["useragent"] !~ "Googlebot" {
url.access-deny = ("")
}
}=~— matches regex (case-sensitive)=~*— matches regex (case-insensitive, 1.4.46+)!~— does not match regex==— exact string match!=— not equal
Rate limiting options
Lighttpd does not have built-in request rate limiting like nginx or HAProxy. Options:
Option 1: Connection limiting (built-in)
# Limit concurrent connections per IP
server.max-connections = 1024
# Per-IP connection limiting (Lighttpd 1.4.46+)
$HTTP["remoteip"] !~ "^(127\.0\.0\.1|10\.).*" {
connection.limit = 20
}Option 2: iptables rate limiting (OS level)
# Limit each IP to 60 new connections per minute to port 80/443
iptables -A INPUT -p tcp --dport 80 -m state --state NEW -m limit --limit 60/min --limit-burst 20 -j ACCEPT
iptables -A INPUT -p tcp --dport 80 -m state --state NEW -j DROP
ip6tables -A INPUT -p tcp --dport 443 -m state --state NEW -m limit --limit 60/min --limit-burst 20 -j ACCEPTOption 3: fail2ban integration
# /etc/fail2ban/filter.d/lighttpd-bot.conf
[Definition]
failregex = ^<HOST> .* "(GET|POST|HEAD) .* HTTP/.*" 403
ignoreregex =
# /etc/fail2ban/jail.local
[lighttpd-bot]
enabled = true
port = http,https
filter = lighttpd-bot
logpath = /var/log/lighttpd/access.log
maxretry = 10
findtime = 60
bantime = 3600Full lighttpd.conf example
# /etc/lighttpd/lighttpd.conf
server.modules = (
"mod_access",
"mod_setenv",
"mod_accesslog",
"mod_rewrite",
"mod_redirect",
"mod_compress",
"mod_fastcgi",
)
# Basic server config
server.document-root = "/var/www/html"
server.port = 80
server.bind = "0.0.0.0"
server.username = "www-data"
server.groupname = "www-data"
server.pid-file = "/run/lighttpd.pid"
server.errorlog = "/var/log/lighttpd/error.log"
# Access logging
accesslog.filename = "/var/log/lighttpd/access.log"
accesslog.format = "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i""
# MIME types
mimetype.assign = (
".html" => "text/html; charset=utf-8",
".css" => "text/css",
".js" => "application/javascript",
".json" => "application/json",
".png" => "image/png",
".jpg" => "image/jpeg",
".svg" => "image/svg+xml",
".woff2" => "font/woff2",
".txt" => "text/plain",
".xml" => "application/xml",
)
# Index files
index-file.names = ("index.html", "index.php")
# ── Bot blocking ─────────────────────────────────────────────────────────────
# Block AI training and scraping bots by User-Agent
$HTTP["useragent"] =~ "(?i)(GPTBot|ClaudeBot|anthropic-ai|CCBot|Google-Extended|AhrefsBot|Bytespider|Amazonbot|Diffbot|FacebookBot|cohere-ai|PerplexityBot|YouBot)" {
url.access-deny = ("")
}
# ── Response headers ─────────────────────────────────────────────────────────
setenv.add-response-header = (
"X-Robots-Tag" => "noai, noimageai",
"X-Content-Type-Options" => "nosniff",
"X-Frame-Options" => "SAMEORIGIN",
"Referrer-Policy" => "strict-origin-when-cross-origin",
)
# ── Static file caching ───────────────────────────────────────────────────────
$HTTP["url"] =~ ".(css|js|png|jpg|jpeg|gif|ico|woff2|svg)$" {
expire.url = ( "" => "access plus 1 months" )
}
# ── HTTPS redirect (if handling SSL offload) ──────────────────────────────────
# Typically done at the load balancer/proxy level
# $HTTP["scheme"] == "http" {
# url.redirect = ( "^/(.*)" => "https://example.com/$1" )
# }Test config and reload
# Test config syntax
lighttpd -t -f /etc/lighttpd/lighttpd.conf
# Reload (graceful — no dropped connections)
systemctl reload lighttpd
# Or send HUP signal
kill -HUP $(cat /run/lighttpd.pid)Docker deployment
docker-compose.yml
services:
lighttpd:
image: sebp/lighttpd:latest
ports:
- "80:80"
- "443:443"
volumes:
- ./lighttpd.conf:/etc/lighttpd/lighttpd.conf:ro
- ./html:/var/www/html:ro
- ./ssl:/etc/lighttpd/ssl:ro
restart: unless-stoppedMinimal Dockerfile
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y lighttpd && rm -rf /var/lib/apt/lists/*
COPY lighttpd.conf /etc/lighttpd/lighttpd.conf
COPY html/ /var/www/html/
EXPOSE 80
CMD ["lighttpd", "-D", "-f", "/etc/lighttpd/lighttpd.conf"]FAQ
How do I block AI bots by User-Agent in Lighttpd?
Use mod_access with a $HTTP["useragent"] conditional block: $HTTP["useragent"] =~ "GPTBot|ClaudeBot|..." { url.access-deny = ("") }. The =~ operator does regex matching; =~* is case-insensitive (Lighttpd 1.4.46+).
What modules do I need to block AI bots in Lighttpd?
mod_access for url.access-deny and mod_setenv for setenv.add-response-header. Both are bundled with Lighttpd — just add them to server.modules in lighttpd.conf.
How do I add X-Robots-Tag in Lighttpd?
setenv.add-response-header = ("X-Robots-Tag" => "noai, noimageai") after loading mod_setenv. Use set-response-header (1.4.46+) instead of add-response-header if your backend might also set it, to avoid duplicates.
How do I serve robots.txt in Lighttpd?
Place robots.txt in server.document-root. Lighttpd serves static files from the document root automatically — no extra config needed.
Does Lighttpd support rate limiting?
Not built-in for request rate limiting. Options: connection limiting with connection.limit (per-IP, 1.4.46+), OS-level iptables rate limiting, or fail2ban parsing access logs. For advanced rate limiting, put Cloudflare or nginx in front of Lighttpd.
Can I use conditional blocks to block bots on specific paths?
Yes — nest conditions: $HTTP["url"] =~ "^/blog/" { $HTTP["useragent"] =~ "GPTBot" { url.access-deny = ("") } }. Conditions can be nested up to 3 levels deep. Use !~ to invert (block everything except a pattern).
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.