Skip to content
Data BrokerClaims robots.txtAI Training

How to Block Diffbot: The AI Data Broker Feeding Llama & Mistral

Diffbot isn't building its own AI product — it's crawling the web to sell structured data to companies that are. Blocking one crawler severs a pipeline feeding multiple AI models.

Updated March 2026

Diffbot Is a Data Broker, Not an AI Lab

Most AI crawlers are operated by the company training the model. Diffbot is different: it's a commercial data extraction company that sells structured web data to anyone who pays — including AI labs.

GPTBot / ClaudeBotFirst-party crawlers — OpenAI and Anthropic crawl the web to train their own models.
DiffbotThird-party data broker — crawls the web commercially, structures the data, and sells it downstream.

What Does Diffbot Actually Do?

Diffbot crawls the web and uses computer vision and NLP to extract structured data from web pages — articles, products, organizations, people, and discussion threads. This structured data is then sold to enterprise customers via APIs and bulk datasets.

Diffbot's "Knowledge Graph" contains structured data extracted from billions of web pages. This data has been sold to AI companies including Meta (for Llama training datasets) and Mistral AI, as well as enterprise customers in sales intelligence, competitive analysis, and market research.

The user agent string is: Mozilla/5.0 (compatible; Diffbot/0.1; +https://www.diffbot.com)

How to Block Diffbot

Add this to your robots.txt:

robots.txtBlock Diffbot
User-agent: Diffbot
Disallow: /

Because Diffbot is a commercial data broker with a history of aggressive crawling, consider layering server-level blocking:

nginxBlock by user agent
# Block Diffbot at the server level
if ($http_user_agent ~* "Diffbot") {
    return 403;
}
Cloudflare WAFBlock by user agent
# WAF Custom Rule
Field: User Agent
Operator: contains
Value: Diffbot
Action: Block

Layer your defenses

Diffbot claims to respect robots.txt, but because it's a commercial service charging clients per crawl, some site owners have reported continued access after robots.txt blocks. For high-value content, use robots.txt + server-level blocking together.

Which AI Models Does Diffbot Feed?

Blocking Diffbot is uniquely high-leverage: one block severs a pipeline feeding multiple AI systems.

🦙
Meta (Llama)
Meta has used Diffbot-sourced data as part of the training datasets for its Llama model family. Blocking Diffbot reduces one input vector for future Llama versions.
🌬️
Mistral AI
Mistral has sourced web data through third-party providers including Diffbot for its model training pipeline.
🤖
DiffbotLLM
Diffbot built its own language model (DiffbotLLM) trained primarily on its crawled web corpus. Your content may power this model directly.
🏢
Enterprise customers
Hundreds of companies use Diffbot's APIs for sales intelligence, competitive analysis, and internal AI products. Your content may appear in tools you've never heard of.

What Blocking Diffbot Does (and Doesn't) Do

What it stops
  • • Diffbot from extracting structured data from your site
  • • Your content from entering Diffbot's Knowledge Graph
  • • Downstream use by any company buying Diffbot data
  • • New data flowing to Meta, Mistral, and others via Diffbot
What it doesn't stop
  • • Content Diffbot has already extracted and sold
  • • First-party crawlers (GPTBot, ClaudeBot, etc.)
  • • Other data brokers (Webz.io/Omgili, etc.)
  • • Google or Bing search rankings (unaffected)

Frequently Asked Questions

Does blocking Diffbot affect my Google or Bing rankings?

No. Diffbot has no relationship with any search engine. Blocking it has zero effect on your organic search visibility.

Is Diffbot the same as Common Crawl?

No. Common Crawl is a nonprofit that publishes free, open web archives. Diffbot is a for-profit company that sells structured, extracted data. They're both data sources for AI training, but they operate independently. Block both for comprehensive coverage.

Can I verify Diffbot is respecting my block?

Check your server logs for the Diffbot user agent. Diffbot also publishes its IP ranges at docs.diffbot.com — you can cross-reference server logs against those IPs to verify no requests are getting through.

Why do so few sites block Diffbot?

Awareness. GPTBot and ClaudeBot received major media coverage. Diffbot operates quietly as a B2B data company — most site owners don't know it exists. As of early 2026, fewer than 10% of major websites block Diffbot, compared to much higher rates for GPTBot.

Related Guides

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Related Guides