How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

What is CCBot and who runs it?

CCBot is the web crawler operated by Common Crawl, a US non-profit that has been crawling the open web since 2008. Common Crawl releases its datasets publicly under open licences, making them the default starting point for AI training data for most major language models — including GPT (OpenAI), Gemini (Google), Llama (Meta), Mistral, Falcon, and hundreds of open-source models.

Does blocking CCBot stop my content from appearing in AI models like ChatGPT?

Blocking CCBot going forward prevents your content from entering future Common Crawl snapshots — and therefore from future training runs that use those snapshots. However, it does not remove content already in existing Common Crawl datasets or in models already trained on that data. Deployed models (like GPT-4) already contain any Common Crawl data collected before your block. The effect is prospective, not retroactive.

Does Common Crawl respect robots.txt?

Yes. Common Crawl has consistently stated that CCBot respects robots.txt Disallow directives and has documented this behaviour publicly. Unlike Bytespider and some other crawlers, CCBot has a reliable compliance track record. A robots.txt block is sufficient.

If I block CCBot, will it affect my search engine rankings?

No. CCBot is a research and AI training crawler — it is entirely separate from Googlebot, Bingbot, and other search engine crawlers. Blocking CCBot has no effect on your SEO or search engine rankings. Your pages will continue to be indexed normally by Google, Bing, and other search engines.

Common CrawlRespects robots.txt50+ AI Models

How to Block CCBot: One Rule That Stops 50+ AI Models

CCBot is Common Crawl's web crawler — and Common Crawl data feeds the training sets for GPT, Gemini, Llama, Mistral, and most open-source LLMs. Block CCBot once, block them all.

Updated March 2026

The one rule you need

User-agent: CCBot
Disallow: /

Add to robots.txt. Common Crawl reliably respects it. Deploy, done — your content is excluded from the next crawl snapshot.

What Is Common Crawl?

Common Crawl is a US non-profit that has been crawling the open web continuously since 2008. It publishes monthly snapshots — petabytes of raw web data — under an open licence that anyone can download for free.

This free, open dataset became the foundation of modern AI training. Rather than crawling the web themselves, AI companies download and process Common Crawl data — cleaning it, filtering it, and using it as the base of their training corpus. The result: blocking CCBot affects not just Common Crawl itself, but every model trained on its data.

The C4 dataset (Colossal Clean Crawled Corpus), derived from Common Crawl, is one of the most widely used training datasets in AI history.

Which AI Models Use Common Crawl Data?

Most major LLMs. Here are the confirmed ones with verified public documentation:

Company	Models	Dataset
OpenAI	GPT-3, GPT-4, GPT-4o series	WebText / C4 (Common Crawl derived)
Google DeepMind	Gemini, PaLM, T5	C4 dataset (Common Crawl)
Meta	Llama 1, 2, 3	Common Crawl snapshots
Mistral AI	Mixtral, Mistral 7B	Common Crawl (RefinedWeb/Dolma derived)
TII UAE	Falcon 7B, 40B, 180B	RefinedWeb (Common Crawl)
EleutherAI	GPT-NeoX, Pythia	The Pile (Common Crawl component)
Hugging Face	StarCoder, many open models	Common Crawl (various cleaned versions)
Allen Institute (AI2)	OLMo series	Dolma (Common Crawl)

This list is non-exhaustive. Hundreds of open-source and research models also use Common Crawl-derived data.

What Blocking CCBot Actually Does (and Doesn't Do)

✓ What it stops

• Your pages entering future Common Crawl snapshots
• Future AI models using those snapshots for training
• CCBot crawl traffic on your server

✗ What it doesn't stop

• Already-trained models (GPT-4, Llama 2, etc.)
• Content already in existing CC snapshots
• Other AI crawlers (GPTBot, ClaudeBot — block separately)
• Search engine indexing (Googlebot unaffected)

The prospective effect: Blocking CCBot today affects models trained 6–18 months from now, not models that already exist. If your content appeared in Common Crawl snapshots before you added the block, that data is already in the corpus of deployed models. The block cuts off future collection.

Will Blocking CCBot Hurt My SEO?

No. CCBot and search engine crawlers are completely separate systems.

Googlebot, Bingbot, and other search engine crawlers use their own crawlers and ignore what you say about CCBot. Blocking CCBot has zero effect on your Google rankings, Bing rankings, or any search engine indexing. You can safely add the CCBot Disallow rule without any SEO concern.

Frequently Asked Questions

Is there a way to remove my content from existing Common Crawl datasets?

Common Crawl provides a URL removal request process at commoncrawl.org. You can submit specific URLs or domains for removal from future published snapshots. Note: this does not retroactively remove data from snapshots already used to train existing AI models.

Does blocking CCBot affect my site's ranking in AI answers?

Indirectly, over time. If AI models use future Common Crawl data for training, blocking CCBot means your content won't be in those training runs. But this affects future models, not deployed ones. Your site's presence in ChatGPT, Gemini, or Claude answers reflects already-trained data, which is unaffected by your CCBot block today.

Do I need to block CCBot separately for each subdomain?

robots.txt applies only to the domain it's served from. If you have content at blog.example.com and shop.example.com, each needs its own robots.txt with the CCBot Disallow rule. A rule at example.com/robots.txt does not cover subdomains.

What's the difference between CCBot and AI2Bot?

CCBot is Common Crawl's general-purpose web crawler, whose data feeds many AI models. AI2Bot is the Allen Institute for AI's crawler, which is used specifically for building AI2's research datasets (like Dolma and the ROOTS dataset). Both contribute to AI training data. Blocking CCBot does not block AI2Bot.

Related Guides

How to Block AI2Bot

AI2 builds datasets from Common Crawl

How to Block GPTBot

OpenAI trains on Common Crawl data

How to Block Diffbot

Data broker that feeds multiple AI labs

robots.txt for AI Bots (Complete Guide)

51+ crawlers, full reference table

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.