Skip to content
Complete Guide6 MethodsUpdated April 2026

How to Opt Out of AI Training: Every Method, Ranked

Six ways to stop AI companies from training on your content — ranked by effectiveness, coverage, and effort. What actually works, what's performative, and what to do first.

Updated April 2026

Before you start: what's actually possible

You cannot retroactively remove your content from deployed AI models. If your content was crawled before today and used in a training run, it exists in the model's weights. There is no technical mechanism to extract specific training examples from a deployed neural network.

You can stop future training. Blocking crawlers today means your content won't enter the next batch of training data. As models get retrained and new versions are released, your block takes effect for all future model versions.

Not all AI companies respect opt-outs equally. Most major labs (OpenAI, Anthropic, Google, Meta) reliably respect robots.txt. Some do not. This guide is honest about the gap.

All 6 Opt-Out Methods, Ranked

1
Coverage: Most major AI labsSetup: 5 min

The primary, industry-standard opt-out mechanism. Works for OpenAI, Anthropic, Google, Meta, Mistral, Common Crawl, and most responsible AI companies.

2
Coverage: Some AI training pipelinesSetup: 5 min

HTML meta tags that signal opt-out preference. Limited adoption — not as widely respected as robots.txt, but adds a signal for crawlers that check page-level permissions.

3
Coverage: AI agents & LLM toolsSetup: 10 min

A structured file telling AI assistants and LLM tools how to interact with your site. Not for blocking training crawlers, but controls AI agent behaviour and surfaces preferred content.

4
Coverage: EU-compliant platformsSetup: 15 min

The W3C Text & Data Mining Rights protocol. Declares your rights machine-readably. Has legal weight in the EU under the DSM Directive. Growing adoption among European AI platforms.

5

Company-specific opt-out forms

Retroactive removal only
Coverage: Specific companies onlySetup: 30+ min

Anthropic, OpenAI, and some others offer web forms to request content removal from training data. The only option that addresses already-crawled content — but effectiveness is limited and retroactive removal from deployed models is technically impossible.

6
Coverage: Specific crawlersSetup: 30 min

Block AI crawlers at the HTTP layer — 403 before the request hits your content. Best for crawlers with documented robots.txt non-compliance (e.g., Bytespider). More reliable than robots.txt, but requires maintenance as user agents change.

The 5-Minute Complete Block

If you want to block all major AI training crawlers right now, add this to your robots.txt:

robots.txtBlock all major AI training crawlers
# Common Crawl — feeds 50+ open-source AI models
User-agent: CCBot
Disallow: /

# OpenAI — trains GPT models
User-agent: GPTBot
Disallow: /

# Anthropic — trains Claude models
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /

# Google — trains Gemini (separate from Googlebot search crawler)
User-agent: Google-Extended
Disallow: /

# Meta — trains Llama models
User-agent: meta-externalagent
Disallow: /

# Mistral AI
User-agent: MistralBot
Disallow: /

# ByteDance — ignores robots.txt, but worth adding
User-agent: Bytespider
Disallow: /

# Allen Institute AI
User-agent: AI2Bot
Disallow: /

⚠ This does NOT block search engine crawlers

Googlebot (search), Bingbot (search), and other SEO crawlers are completely separate user agents. Adding these rules has zero effect on your Google or Bing search rankings.

The 8 AI Training Crawlers You Need to Know

These are the highest-impact crawlers to block, based on how many AI models use their data:

User AgentCompany
CCBotCommon Crawl
GPTBotOpenAI
ClaudeBotAnthropic
Google-ExtendedGoogle
meta-externalagentMeta
MistralBotMistral AI
BytespiderByteDance
AI2BotAllen Institute

Company-Specific Opt-Out Forms

For content that was already crawled and used in training, a small number of companies offer removal request forms. These are limited and retroactive removal from deployed model weights is technically not possible — but they can affect future training runs:

Anthropic: Submit URLs or domains for review
Available at privacy.anthropic.com
OpenAI: Content removal request for training data
Available in OpenAI privacy portal
Common Crawl: URL removal from future snapshots
Available at commoncrawl.org

Most AI companies (Google, Meta, Mistral, ByteDance) do not offer public opt-out forms for training data removal. Blocking their crawlers via robots.txt is the primary recourse.

Verify Your Opt-Out Is Working

After making changes, verify your opt-out configuration is correctly formed:

1. Open Shadow free scan

Run a free scan at openshadow.io/check — verifies your robots.txt, meta tags, and overall AI readiness score in one shot.

2. Google Search Console robots.txt tester

Tests that specific user agents (e.g., GPTBot) are correctly disallowed. Access via Search Console → Settings → robots.txt inspector.

3. Server logs

Monitor for AI bot user agents in your access logs. After blocking, requests from those user agents should stop (or return 403 if server-level blocked).

Frequently Asked Questions

Does blocking AI crawlers affect my Google rankings?

No. AI training crawlers (GPTBot, ClaudeBot, CCBot, etc.) are completely separate from search engine crawlers (Googlebot, Bingbot). Blocking AI training bots has zero effect on your SEO. One important distinction: Google-Extended is Google's AI training agent — blocking it does NOT affect Googlebot or your Google search rankings.

What happens if an AI company ignores my robots.txt?

Most major companies respect robots.txt. Bytespider (ByteDance) has documented cases of ignoring it — for this crawler, server-level blocking via nginx or Cloudflare WAF provides stronger enforcement. You can return 403 for requests matching the Bytespider user agent string.

I blocked everything — why is ChatGPT still discussing my content?

ChatGPT's knowledge comes from training data already collected before your block. Deployed models don't update dynamically — they use a fixed snapshot. Your block prevents your content from appearing in future training runs, not current deployed models. As OpenAI trains GPT-5 and beyond, your blocked pages won't be included.

Should I block AI search crawlers like OAI-SearchBot and PerplexityBot?

That depends on your goals. AI search crawlers (OAI-SearchBot, PerplexityBot) are used for AI search products, not AI model training. Blocking them means your content won't appear in ChatGPT Search or Perplexity results — reducing potential referral traffic. For training data protection specifically, focus on GPTBot, ClaudeBot, CCBot, Google-Extended, and meta-externalagent.

Individual Bot Blocking Guides

Detailed guides for each major AI training crawler: