How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

Is my website being used to train AI?

Almost certainly, yes — if your site has been live for more than a few months and you haven't blocked AI crawlers. Common Crawl has been archiving the web since 2008, and its datasets are used to train most major AI models including GPT, Gemini, Llama, and Mistral. If your site was publicly accessible, it was likely crawled. You can verify by checking your server logs for bot user agents like CCBot, GPTBot, and ClaudeBot.

Can I remove my content from ChatGPT or other AI models?

Not from models already deployed. Once content is used to train a model, it exists in the model's weights and cannot be selectively removed. What you can do: block future crawls (robots.txt), submit removal requests to companies that offer them (Anthropic, OpenAI, Common Crawl), and prevent your content from entering future training runs.

Does blocking AI bots affect my Google ranking?

No — AI training bots are separate from search engine crawlers. You can block GPTBot, ClaudeBot, CCBot, and other AI training agents without any effect on your Google or Bing rankings. The one exception to know: Google-Extended is Google's AI training agent (separate from Googlebot). Blocking Google-Extended stops Google Gemini from training on your content — it does not affect Google search rankings.

What happens if I block AI bots now — is it too late?

It's not too late — it just has a prospective effect. Blocking now stops future crawls and prevents your content from entering future training datasets. Your content that was already crawled remains in existing datasets and deployed models. But as AI labs train new model versions (which happens every 6–18 months), your block means your newer content won't be included in those future models.

For Website OwnersLikely yes10 min fix

Is AI Using My Website Content?

Almost certainly, yes — if your site has been public for more than a few months without blocking AI bots. Here's how to confirm it, check which bots have visited, and stop future crawls in under 10 minutes.

Updated April 2026

The short answer: almost certainly yes

Common Crawl has been archiving the public web continuously since 2008. Its datasets — petabytes of crawled web content released for free — are the default training data for most major AI models: GPT (OpenAI), Gemini (Google), Llama (Meta), Mistral, Falcon, and hundreds of open-source models.

If your site has been publicly accessible and you haven't blocked CCBot in your robots.txt, your content is almost certainly in Common Crawl's archive — and therefore in the training data of dozens of AI models.

Step 1: Find Out Which AI Bots Have Visited

Option A: Free scan (fastest)

Run Open Shadow's free scan — it checks your robots.txt configuration and tells you which AI bots are currently allowed vs blocked on your site. This shows your current exposure, not historical visits.

Option B: Server log analysis

Your access logs record every visitor — including AI bots. Search for known AI user agents:

nginx / Apache

grep -iE "CCBot|GPTBot|ClaudeBot|PerplexityBot|Google-Extended|meta-externalagent|MistralBot|Bytespider" /var/log/nginx/access.log | tail -50

Each line in the results is a page request from that AI bot, including the URL it fetched and the timestamp.

Option C: Cloudflare Analytics

Cloudflare's Firewall Events log captures bot activity with user agent details. In the Cloudflare dashboard: Security → Firewall → Firewall Events → filter by user agent. Known AI bots are also identified in Cloudflare's Bot Analytics report under "Verified Bots."

The AI Bots That Train on Web Content

These are the user agent strings to look for in your logs. Any of these in your logs means that AI company has fetched content from your site:

User Agent	Company	Trains which models
`CCBot`	Common Crawl	GPT, Gemini, Llama, Mistral, Falcon + 40 more
`GPTBot`	OpenAI	GPT-4o, future GPT models
`ClaudeBot`	Anthropic	Claude 3, Claude 4 series
`Google-Extended`	Google	Gemini models
`meta-externalagent`	Meta	Llama 3 and future versions
`MistralBot`	Mistral AI	Mixtral, Mistral series
`Bytespider`	ByteDance	TikTok AI and ByteDance models
`AI2Bot`	Allen Institute	Dolma, OLMo datasets
`PerplexityBot`	Perplexity	Perplexity AI search index
`anthropic-ai`	Anthropic	Additional Claude training (alternate token)

Step 2: Stop Future Crawls (10 Minutes)

Add this to your robots.txt file (in the root of your domain):

robots.txt — block all major AI training crawlers

User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: MistralBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: AI2Bot
Disallow: /

✓ Safe for SEO

These rules don't affect Googlebot, Bingbot, or any search engine crawler. Your SEO is completely unaffected.

⚠ Prospective only

This stops future crawls. Content already in AI training datasets remains there — you cannot retroactively remove it.

What About Content Already in AI Models?

If an AI bot crawled your site before you added these blocks, that content may already be in a training dataset. Here's the honest picture:

✗

You cannot "unlearn" content from deployed models. Neural network weights don't store individual training examples in a recoverable way. There's no technical mechanism to surgically remove your content from GPT-4 or Llama 3.

Removal request forms exist but have limited impact. Anthropic (privacy.anthropic.com), OpenAI, and Common Crawl (commoncrawl.org) offer forms to request content removal. These affect future training runs, not deployed models.

✓

Blocking works for future models. AI labs retrain models every 6–18 months. Block now, and your content won't be in GPT-5, Llama 4, Gemini Next, or whatever comes after. The effect compounds over time.

Where to Add robots.txt for Your Platform

Next.js

Create public/robots.txt in your project root, or use the robots() function in app/robots.ts (Next.js 13+)

WordPress

Edit via Settings → Reading → "Search engine visibility" or install a robots.txt editor plugin. Or FTP to create/edit /public_html/robots.txt directly.

Shopify

Online Store → Preferences → robots.txt. Shopify auto-generates it — use the robots.txt.liquid template to add custom rules.

Squarespace / Wix

Limited robots.txt control. Squarespace: Settings → Advanced → External Services. Wix: Marketing & SEO → SEO Settings → robots.txt.

Static sites (Netlify, Vercel)

Create robots.txt in the root of your /public directory. It deploys automatically with your site.

Any server

Place robots.txt at the root of your web server (e.g., /var/www/html/robots.txt). It must be accessible at yourdomain.com/robots.txt.

Frequently Asked Questions

How do I know if my content is in ChatGPT's knowledge?

You can test this directly: ask ChatGPT to tell you about your website or business. If it returns accurate, specific information about your site's content, your material is likely in its training data. This isn't definitive proof (ChatGPT may also be drawing on search results via ChatGPT-User), but accurate factual recall often indicates training data inclusion.

A competitor's AI product is clearly using my content. What can I do?

First, block their crawler in robots.txt (prevents future use). Then submit a removal request if they offer one. If you believe they violated your terms of service or copyright, document the evidence and consult a lawyer. Several publishers have filed lawsuits against AI companies for unauthorized content use — New York Times vs OpenAI being the highest-profile example.

Does blocking AI bots mean my content won't appear in AI search results?

It depends on which bots you block. Blocking training crawlers (GPTBot, CCBot, ClaudeBot) prevents your content from being used to train AI models. But AI search products (Perplexity, ChatGPT Search, Google AI Overviews) use separate crawlers (PerplexityBot, OAI-SearchBot, Googlebot). If you want to appear in AI search results, allow those while blocking training crawlers.

I'm a small blog — does this really matter?

It matters if your content has commercial value, if you rely on traffic from search (AI search is cannibalizing some traditional search traffic), or if you write about topics where being used without attribution or credit concerns you. For small, purely hobbyist sites, the practical impact is lower — but the principle of consent applies regardless of site size.