How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

Does Diffbot respect robots.txt?

Diffbot officially claims to respect robots.txt Disallow directives. However, because Diffbot is a commercial data extraction service that charges clients per crawl, some site owners have reported continued crawling after adding Diffbot to robots.txt. For high-value content, combining robots.txt with nginx or Cloudflare IP blocking is advisable. Diffbot's IP ranges are published at docs.diffbot.com.

Is Diffbot the same as other AI crawlers like GPTBot or ClaudeBot?

No. GPTBot (OpenAI) and ClaudeBot (Anthropic) are first-party crawlers — OpenAI and Anthropic crawl the web directly to train their own models. Diffbot is a third-party data broker: it crawls the web commercially, structures the data, and sells it to any paying customer — including AI companies. This means one Diffbot block can sever supply to multiple AI training pipelines simultaneously.

Does blocking Diffbot affect my Google rankings?

No. Diffbot is entirely separate from Googlebot and has no relationship with Google Search. Blocking Diffbot in robots.txt has zero effect on Google indexing or rankings. The Disallow directive in robots.txt is user-agent specific — you can block Diffbot while leaving Googlebot completely unrestricted.

What is Diffbot's user agent string?

Diffbot uses the user agent token 'Diffbot' in robots.txt. The full user agent string in HTTP requests is: Mozilla/5.0 (compatible; Diffbot/0.1; +https://www.diffbot.com). For robots.txt, you only need 'User-agent: Diffbot'. For server-level blocking (nginx, Apache, Cloudflare), match the full string or use Diffbot's published IP ranges.

Can I block Diffbot without affecting other scrapers?

Yes. Each robots.txt User-agent block is independent. You can add 'User-agent: Diffbot / Disallow: /' as a separate block without affecting rules for Googlebot, GPTBot, or any other crawler. Diffbot operates its own dedicated IP ranges, so IP-level blocks are also precise and won't catch unrelated traffic.

Data BrokerClaims robots.txtAI Training

How to Block Diffbot: The AI Data Broker Feeding Llama & Mistral

Diffbot isn't building its own AI product — it's crawling the web to sell structured data to companies that are. Blocking one crawler severs a pipeline feeding multiple AI models.

Updated March 2026

Diffbot Is a Data Broker, Not an AI Lab

Most AI crawlers are operated by the company training the model. Diffbot is different: it's a commercial data extraction company that sells structured web data to anyone who pays — including AI labs.

GPTBot / ClaudeBotFirst-party crawlers — OpenAI and Anthropic crawl the web to train their own models.

DiffbotThird-party data broker — crawls the web commercially, structures the data, and sells it downstream.

What Does Diffbot Actually Do?

Diffbot crawls the web and uses computer vision and NLP to extract structured data from web pages — articles, products, organizations, people, and discussion threads. This structured data is then sold to enterprise customers via APIs and bulk datasets.

Diffbot's "Knowledge Graph" contains structured data extracted from billions of web pages. This data has been sold to AI companies including Meta (for Llama training datasets) and Mistral AI, as well as enterprise customers in sales intelligence, competitive analysis, and market research.

The user agent string is: Mozilla/5.0 (compatible; Diffbot/0.1; +https://www.diffbot.com)

How to Block Diffbot

Add this to your robots.txt:

robots.txtBlock Diffbot

User-agent: Diffbot
Disallow: /

Because Diffbot is a commercial data broker with a history of aggressive crawling, consider layering server-level blocking:

nginxBlock by user agent

# Block Diffbot at the server level
if ($http_user_agent ~* "Diffbot") {
    return 403;
}

Cloudflare WAFBlock by user agent

# WAF Custom Rule
Field: User Agent
Operator: contains
Value: Diffbot
Action: Block

Layer your defenses

Diffbot claims to respect robots.txt, but because it's a commercial service charging clients per crawl, some site owners have reported continued access after robots.txt blocks. For high-value content, use robots.txt + server-level blocking together.

Which AI Models Does Diffbot Feed?

Blocking Diffbot is uniquely high-leverage: one block severs a pipeline feeding multiple AI systems.

🦙

Meta (Llama)

Meta has used Diffbot-sourced data as part of the training datasets for its Llama model family. Blocking Diffbot reduces one input vector for future Llama versions.

🌬️

Mistral AI

Mistral has sourced web data through third-party providers including Diffbot for its model training pipeline.

🤖

DiffbotLLM

Diffbot built its own language model (DiffbotLLM) trained primarily on its crawled web corpus. Your content may power this model directly.

🏢

Enterprise customers

Hundreds of companies use Diffbot's APIs for sales intelligence, competitive analysis, and internal AI products. Your content may appear in tools you've never heard of.

What Blocking Diffbot Does (and Doesn't) Do

What it stops

• Diffbot from extracting structured data from your site
• Your content from entering Diffbot's Knowledge Graph
• Downstream use by any company buying Diffbot data
• New data flowing to Meta, Mistral, and others via Diffbot

What it doesn't stop

• Content Diffbot has already extracted and sold
• First-party crawlers (GPTBot, ClaudeBot, etc.)
• Other data brokers (Webz.io/Omgili, etc.)
• Google or Bing search rankings (unaffected)

Frequently Asked Questions

Does blocking Diffbot affect my Google or Bing rankings?

No. Diffbot has no relationship with any search engine. Blocking it has zero effect on your organic search visibility.

Is Diffbot the same as Common Crawl?

No. Common Crawl is a nonprofit that publishes free, open web archives. Diffbot is a for-profit company that sells structured, extracted data. They're both data sources for AI training, but they operate independently. Block both for comprehensive coverage.

Can I verify Diffbot is respecting my block?

Check your server logs for the Diffbot user agent. Diffbot also publishes its IP ranges at docs.diffbot.com — you can cross-reference server logs against those IPs to verify no requests are getting through.

Why do so few sites block Diffbot?

Awareness. GPTBot and ClaudeBot received major media coverage. Diffbot operates quietly as a B2B data company — most site owners don't know it exists. As of early 2026, fewer than 10% of major websites block Diffbot, compared to much higher rates for GPTBot.

Related Guides

How to Block CCBot

Common Crawl — the largest open dataset

How to Block Webz.io/Omgili

Another data broker that sells to AI labs

How to Block MistralBot

Mistral uses Diffbot-sourced data

robots.txt for AI Bots (Complete Guide)

51+ crawlers, full reference table

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.