How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

Can I completely stop AI from training on my content?

You can significantly reduce it going forward, but not retroactively. If your content has already been crawled and included in a training dataset, it exists in the weights of deployed models — there's no technical mechanism to 'unlearn' it. What you can do: block future crawls, submit removal requests to companies with opt-out forms, and prevent new content from entering future training runs.

Does robots.txt actually work to stop AI training?

It works for companies that respect it — OpenAI, Anthropic, Google, Meta, Mistral, and Common Crawl all have documented track records of respecting robots.txt Disallow directives. It does NOT work for companies that ignore it (Bytespider/ByteDance has been documented ignoring robots.txt). For the majority of responsible AI labs, robots.txt is the most reliable opt-out mechanism available.

What is the fastest way to block all AI training crawlers?

The fastest blanket block is two robots.txt rules: 'User-agent: CCBot / Disallow: /' blocks Common Crawl, whose data feeds most open-source AI models (Llama, Mistral, Falcon, and more). 'User-agent: GPTBot / Disallow: /' blocks OpenAI. Add ClaudeBot, Google-Extended, and meta-externalagent for full coverage of the major labs. One robots.txt file, five rules, covers the bulk of AI training crawls.

Does blocking AI crawlers hurt my search engine rankings?

No — as long as you block AI training crawlers specifically and not search engine crawlers. GPTBot, ClaudeBot, CCBot, and other AI training agents are completely separate from Googlebot, Bingbot, and other search engine crawlers. Blocking AI training bots has zero effect on your SEO. Just be careful: Google-Extended is specifically Google's AI training agent, distinct from Googlebot (the search crawler). Blocking Google-Extended does not affect Google search rankings.

How do I know if my opt-out is working?

Use Open Shadow's free scan to verify your robots.txt rules are correctly formed and that AI bots show as blocked for your domain. For robots.txt changes, you can also verify using Google Search Console's robots.txt tester (it shows what specific user agents can/cannot access). For server-level blocks, check your nginx or Cloudflare logs for 403 responses from the targeted user agents.

Complete Guide6 MethodsUpdated April 2026

How to Opt Out of AI Training: Every Method, Ranked

Six ways to stop AI companies from training on your content — ranked by effectiveness, coverage, and effort. What actually works, what's performative, and what to do first.

Updated April 2026

Before you start: what's actually possible

You cannot retroactively remove your content from deployed AI models. If your content was crawled before today and used in a training run, it exists in the model's weights. There is no technical mechanism to extract specific training examples from a deployed neural network.

You can stop future training. Blocking crawlers today means your content won't enter the next batch of training data. As models get retrained and new versions are released, your block takes effect for all future model versions.

Not all AI companies respect opt-outs equally. Most major labs (OpenAI, Anthropic, Google, Meta) reliably respect robots.txt. Some do not. This guide is honest about the gap.

All 6 Opt-Out Methods, Ranked

robots.txt Disallow rules

High

Coverage: Most major AI labsSetup: 5 min

The primary, industry-standard opt-out mechanism. Works for OpenAI, Anthropic, Google, Meta, Mistral, Common Crawl, and most responsible AI companies.

noai & noimageai meta tags

Moderate

Coverage: Some AI training pipelinesSetup: 5 min

HTML meta tags that signal opt-out preference. Limited adoption — not as widely respected as robots.txt, but adds a signal for crawlers that check page-level permissions.

llms.txt file

Moderate

Coverage: AI agents & LLM toolsSetup: 10 min

A structured file telling AI assistants and LLM tools how to interact with your site. Not for blocking training crawlers, but controls AI agent behaviour and surfaces preferred content.

TDMRep (W3C standard)

Moderate

Coverage: EU-compliant platformsSetup: 15 min

The W3C Text & Data Mining Rights protocol. Declares your rights machine-readably. Has legal weight in the EU under the DSM Directive. Growing adoption among European AI platforms.

Company-specific opt-out forms

Retroactive removal only

Coverage: Specific companies onlySetup: 30+ min

Anthropic, OpenAI, and some others offer web forms to request content removal from training data. The only option that addresses already-crawled content — but effectiveness is limited and retroactive removal from deployed models is technically impossible.

Server-level blocking (nginx, Cloudflare)

Very high

Coverage: Specific crawlersSetup: 30 min

Block AI crawlers at the HTTP layer — 403 before the request hits your content. Best for crawlers with documented robots.txt non-compliance (e.g., Bytespider). More reliable than robots.txt, but requires maintenance as user agents change.

The 5-Minute Complete Block

If you want to block all major AI training crawlers right now, add this to your robots.txt:

robots.txtBlock all major AI training crawlers

# Common Crawl — feeds 50+ open-source AI models
User-agent: CCBot
Disallow: /

# OpenAI — trains GPT models
User-agent: GPTBot
Disallow: /

# Anthropic — trains Claude models
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /

# Google — trains Gemini (separate from Googlebot search crawler)
User-agent: Google-Extended
Disallow: /

# Meta — trains Llama models
User-agent: meta-externalagent
Disallow: /

# Mistral AI
User-agent: MistralBot
Disallow: /

# ByteDance — ignores robots.txt, but worth adding
User-agent: Bytespider
Disallow: /

# Allen Institute AI
User-agent: AI2Bot
Disallow: /

⚠ This does NOT block search engine crawlers

Googlebot (search), Bingbot (search), and other SEO crawlers are completely separate user agents. Adding these rules has zero effect on your Google or Bing search rankings.

The 8 AI Training Crawlers You Need to Know

These are the highest-impact crawlers to block, based on how many AI models use their data:

User Agent	Company	Why it matters
CCBot	Common Crawl	Feeds 50+ models incl. GPT, Gemini, Llama
GPTBot	OpenAI	Trains GPT-4o and future GPT models
ClaudeBot	Anthropic	Trains Claude 3, Claude 4 series
Google-Extended	Google	Trains Gemini models
meta-externalagent	Meta	Trains Llama 3 and future versions
MistralBot	Mistral AI	Trains Mixtral and Mistral series
Bytespider	ByteDance	Feeds TikTok AI and other ByteDance models
AI2Bot	Allen Institute	Builds Dolma and AI2 research datasets

Company-Specific Opt-Out Forms

For content that was already crawled and used in training, a small number of companies offer removal request forms. These are limited and retroactive removal from deployed model weights is technically not possible — but they can affect future training runs:

Anthropic: Submit URLs or domains for review

Available at privacy.anthropic.com

OpenAI: Content removal request for training data

Available in OpenAI privacy portal

Common Crawl: URL removal from future snapshots

Available at commoncrawl.org

Most AI companies (Google, Meta, Mistral, ByteDance) do not offer public opt-out forms for training data removal. Blocking their crawlers via robots.txt is the primary recourse.

Verify Your Opt-Out Is Working

After making changes, verify your opt-out configuration is correctly formed:

1. Open Shadow free scan

Run a free scan at openshadow.io/check — verifies your robots.txt, meta tags, and overall AI readiness score in one shot.

2. Google Search Console robots.txt tester

Tests that specific user agents (e.g., GPTBot) are correctly disallowed. Access via Search Console → Settings → robots.txt inspector.

3. Server logs

Monitor for AI bot user agents in your access logs. After blocking, requests from those user agents should stop (or return 403 if server-level blocked).

Frequently Asked Questions

Does blocking AI crawlers affect my Google rankings?

No. AI training crawlers (GPTBot, ClaudeBot, CCBot, etc.) are completely separate from search engine crawlers (Googlebot, Bingbot). Blocking AI training bots has zero effect on your SEO. One important distinction: Google-Extended is Google's AI training agent — blocking it does NOT affect Googlebot or your Google search rankings.

What happens if an AI company ignores my robots.txt?

Most major companies respect robots.txt. Bytespider (ByteDance) has documented cases of ignoring it — for this crawler, server-level blocking via nginx or Cloudflare WAF provides stronger enforcement. You can return 403 for requests matching the Bytespider user agent string.

I blocked everything — why is ChatGPT still discussing my content?

ChatGPT's knowledge comes from training data already collected before your block. Deployed models don't update dynamically — they use a fixed snapshot. Your block prevents your content from appearing in future training runs, not current deployed models. As OpenAI trains GPT-5 and beyond, your blocked pages won't be included.

Should I block AI search crawlers like OAI-SearchBot and PerplexityBot?

That depends on your goals. AI search crawlers (OAI-SearchBot, PerplexityBot) are used for AI search products, not AI model training. Blocking them means your content won't appear in ChatGPT Search or Perplexity results — reducing potential referral traffic. For training data protection specifically, focus on GPTBot, ClaudeBot, CCBot, Google-Extended, and meta-externalagent.