AI Bot Guides
Practical, copy-paste-ready guides for controlling AI crawlers, optimising for AI search, and protecting your content.
robots.txt for AI Bots: The Complete 2026 Guide
Control GPTBot, ClaudeBot, PerplexityBot, Bytespider, and 51+ crawlers. Ready-to-use configs, per-bot reference table, and the 5 mistakes that break your SEO.
Read guide →noai & noimageai: Block AI Training with Meta Tags
Opt out of AI training on a per-page basis without touching robots.txt. HTML meta tag and X-Robots-Tag examples for every server stack, plus CMS quick guides.
Read guide →llms.txt: The Complete Guide for 2026
The emerging standard that tells AI assistants exactly what your site is about. Full spec, copy-paste templates for Next.js, static sites, and WordPress, plus AI adoption table.
Read guide →How AI Search Engines Decide What to Surface (2026)
What PerplexityBot, OAI-SearchBot, Google-Extended, and Claude look for when choosing which pages to feature in AI answers. The 7 signals that matter, with a full optimisation checklist.
Read guide →Blocking Bytespider: Why robots.txt Isn't Enough
ByteDance's Bytespider crawler has been documented ignoring robots.txt. Here's how to block it at the server level: nginx, Cloudflare WAF, Vercel, Apache, and Next.js middleware.
Read guide →AI Readiness Score: What It Measures and How to Improve It
A breakdown of all 6 scoring categories, every check, grade thresholds, and the fastest path from a D to an A — based on the actual Open Shadow scanner methodology.
Read guide →How to Block Google-Extended: Stop Gemini AI Training
Google-Extended is Google's dedicated AI training crawler for Gemini and Bard. Block it in robots.txt without touching your Search rankings — with verification steps and Next.js config.
Read guide →How to Block GPTBot: Stop OpenAI Training on Your Site
GPTBot is OpenAI's training crawler for GPT-4 and beyond. Block it in robots.txt in 60 seconds — plus the critical difference between GPTBot, ChatGPT-User, and OAI-SearchBot.
Read guide →How to Block ClaudeBot: Stop Anthropic Training on Your Site
ClaudeBot is Anthropic's training crawler for Claude models. Block both ClaudeBot and anthropic-ai tokens — plus how to request removal of already-crawled content.
Read guide →How to Block PerplexityBot: Scraping Controversy Explained
PerplexityBot was at the centre of a 2024 robots.txt controversy. Block both PerplexityBot and perplexity-user — and understand the visibility tradeoff before you do.
Read guide →How to Block meta-externalagent: Stop Meta Training Llama on Your Site
Meta runs two crawlers — one for link previews, one for AI training. Most guides only cover the preview bot. Here's how to block the one that trains Llama without breaking your Facebook shares.
Read guide →How to Block CCBot: One Rule That Stops 50+ AI Models
Common Crawl's CCBot feeds training data to GPT, Gemini, Llama, Mistral, Falcon, and most open-source LLMs. Blocking it is the highest-leverage AI training opt-out you can make. One robots.txt line.
Read guide →Bingbot & Microsoft Copilot: Control What Copilot Knows About Your Site
Copilot draws from Bing's index — and so does ChatGPT Search. Blocking Bingbot removes you from both, but also kills your Bing Search traffic. Here's the full pipeline and the right call for your situation.
Read guide →How to Block ChatGPT-User: Stop Real-Time Browsing on Your Site
ChatGPT-User isn't a crawler — it fires when a user explicitly asks ChatGPT to read a URL. Blocking it stops on-demand page reads. Essential for paywalled publishers. Zero effect on training or search indexing.
Read guide →How to Block OAI-SearchBot: Control Your ChatGPT Search Presence
OAI-SearchBot indexes your site for ChatGPT Search — it's NOT the training crawler. Blocking it removes you from ChatGPT Search results. Here's the three-bot breakdown and when each block makes sense.
Read guide →How to Block Applebot-Extended: Stop Apple Intelligence Training
Applebot-Extended is Apple's AI training crawler — separate from the Applebot that powers Siri and Spotlight. Block AI training without losing your Spotlight or App Store presence.
Read guide →How to Block Diffbot: The AI Data Broker Feeding Llama & Mistral
Diffbot isn't a search engine — it's a commercial data broker that crawls your site and sells structured content to AI companies. One block cuts supply to Meta Llama, Mistral, DiffbotLLM, and more.
Read guide →How to Block xAI-Bot: Stop Grok from Training on Your Site
xAI-Bot is Elon Musk's crawler for training Grok — embedded inside X (Twitter). It actively targets news and real-time content. Here's how to opt out and what the X/Twitter data pipeline means for publishers.
Read guide →How to Block MistralBot: Stop Europe's Leading AI Lab from Training on Your Site
MistralBot is Mistral AI's training crawler — the French lab behind Mistral Large, Mixtral, and Le Chat. GDPR and the EU AI Act give publishers extra leverage here. Plus: why blocking CCBot too is the full fix.
Read guide →How to Block DeepSeekBot: Stop DeepSeek from Training on Your Site
DeepSeekBot crawls your site for DeepSeek's frontier models — V3, R1, and beyond. The crawler that stunned the AI world in 2025. What makes it different: Chinese jurisdiction, outside GDPR and US AI regulation. Here's the full opt-out.
Read guide →How to Block Amazonbot: Amazon's 3-Crawler Ecosystem Explained
Amazon runs three distinct bots: Amazonbot (AI training), Amzn-SearchBot (Rufus AI + Alexa), and Amzn-User (live queries). Most guides miss the difference. Blocking the wrong one kills your Rufus visibility. Here's the full breakdown.
Read guide →How to Block YouBot: You.com's AI Search Crawler
YouBot indexes your site for You.com's AI assistant answers — it's a search crawler, not a training crawler. Blocking it removes you from You.com results. Here's the tradeoff, the decision matrix, and the robots.txt config.
Read guide →How to Block AI2Bot: Allen Institute's Two AI Crawlers Explained
The Allen Institute for AI runs two separate web crawlers: AI2Bot (academic research + Semantic Scholar) and Ai2Bot-Dolma (the open-source training dataset powering OLMo). Different purposes — different blocking decisions. Here's the breakdown.
Read guide →How to Block cohere-ai: Cohere's Undocumented Web Crawler
cohere-ai crawls publisher sites without official documentation explaining what it collects. Operated by Cohere — the enterprise AI lab behind Command R and Embed. Only ~13% of major sites block it. Here's the robots.txt config and the full story.
Read guide →How to Block DuckAssistBot: DuckDuckGo's AI Answer Crawler
The privacy-first search engine deployed its own AI crawler. DuckAssistBot powers DuckDuckGo AI summaries and Duck.ai — and it's separate from their search indexer. Block AI answers without losing DuckDuckGo search rankings.
Read guide →How to Block Gemini-Deep-Research: Google's AI Research Crawler
Gemini-Deep-Research reads your entire site to compile AI reports for Gemini Advanced users — it's not the training crawler. Here's the full Google AI bot ecosystem, how Gemini-Deep-Research differs from Google-Extended, and how to block it without killing your SEO.
Read guide →How to Block Google-NotebookLM: Google's Viral AI Notebook Crawler
NotebookLM went viral by turning any URL into an AI podcast. The crawler behind it reads your pages when users add your site as a source — turning your content into AI audio without a click to your site. Here's how to block it.
Read guide →How to Block Webz.io & Omgili: The AI Data Broker Behind Three Crawlers
Webz.io operates under three identities — omgili, omgilibot, and webzio-extended — selling your web content to AI companies. One Disallow rule isn't enough. Here's how to block all three and why only webzio-extended needs blocking to stop AI training.
Read guide →AI Content Protection Tools Compared: Free & Paid (2026)
From a one-line robots.txt edit to enterprise bot management. Honest comparison of every tool available — free and paid — with a decision framework based on your actual risk level. No upsell, just what works.
Read guide →How to Monitor AI Bot Traffic on Your Site
Most site owners have zero visibility into AI bot traffic — Google Analytics doesn't show it. Learn 5 methods: server log analysis, dedicated bot log files, Next.js middleware, Cloudflare analytics, and real-time monitoring with alerts.
Read guide →How to Block AI Agents: When robots.txt Isn't Enough
AI agents don't crawl — they browse. Firecrawl, browser-use, Playwright MCP, and Stagehand bypass robots.txt entirely. Five defence layers that actually work: headless detection, behavioural analysis, honeypots, rate limiting, and TLS fingerprinting.
Read guide →TDMRep: The W3C Protocol That Gives Your AI Opt-Out Legal Teeth
robots.txt is a gentleman's agreement. TDMRep is backed by EU law. The W3C's Text and Data Mining Reservation Protocol lets you formally reserve rights over your content — with legal enforcement under the EU AI Act and CDSM Directive.
Read guide →