How to Block Diffbot: The AI Data Broker Feeding Llama & Mistral
Diffbot isn't building its own AI product — it's crawling the web to sell structured data to companies that are. Blocking one crawler severs a pipeline feeding multiple AI models.
Updated March 2026
Diffbot Is a Data Broker, Not an AI Lab
Most AI crawlers are operated by the company training the model. Diffbot is different: it's a commercial data extraction company that sells structured web data to anyone who pays — including AI labs.
GPTBot / ClaudeBotFirst-party crawlers — OpenAI and Anthropic crawl the web to train their own models.DiffbotThird-party data broker — crawls the web commercially, structures the data, and sells it downstream.What Does Diffbot Actually Do?
Diffbot crawls the web and uses computer vision and NLP to extract structured data from web pages — articles, products, organizations, people, and discussion threads. This structured data is then sold to enterprise customers via APIs and bulk datasets.
Diffbot's "Knowledge Graph" contains structured data extracted from billions of web pages. This data has been sold to AI companies including Meta (for Llama training datasets) and Mistral AI, as well as enterprise customers in sales intelligence, competitive analysis, and market research.
The user agent string is: Mozilla/5.0 (compatible; Diffbot/0.1; +https://www.diffbot.com)
How to Block Diffbot
Add this to your robots.txt:
User-agent: Diffbot Disallow: /
Because Diffbot is a commercial data broker with a history of aggressive crawling, consider layering server-level blocking:
# Block Diffbot at the server level
if ($http_user_agent ~* "Diffbot") {
return 403;
}# WAF Custom Rule Field: User Agent Operator: contains Value: Diffbot Action: Block
Layer your defenses
Diffbot claims to respect robots.txt, but because it's a commercial service charging clients per crawl, some site owners have reported continued access after robots.txt blocks. For high-value content, use robots.txt + server-level blocking together.
Which AI Models Does Diffbot Feed?
Blocking Diffbot is uniquely high-leverage: one block severs a pipeline feeding multiple AI systems.
What Blocking Diffbot Does (and Doesn't) Do
- • Diffbot from extracting structured data from your site
- • Your content from entering Diffbot's Knowledge Graph
- • Downstream use by any company buying Diffbot data
- • New data flowing to Meta, Mistral, and others via Diffbot
- • Content Diffbot has already extracted and sold
- • First-party crawlers (GPTBot, ClaudeBot, etc.)
- • Other data brokers (Webz.io/Omgili, etc.)
- • Google or Bing search rankings (unaffected)
Frequently Asked Questions
Does blocking Diffbot affect my Google or Bing rankings?
No. Diffbot has no relationship with any search engine. Blocking it has zero effect on your organic search visibility.
Is Diffbot the same as Common Crawl?
No. Common Crawl is a nonprofit that publishes free, open web archives. Diffbot is a for-profit company that sells structured, extracted data. They're both data sources for AI training, but they operate independently. Block both for comprehensive coverage.
Can I verify Diffbot is respecting my block?
Check your server logs for the Diffbot user agent. Diffbot also publishes its IP ranges at docs.diffbot.com — you can cross-reference server logs against those IPs to verify no requests are getting through.
Why do so few sites block Diffbot?
Awareness. GPTBot and ClaudeBot received major media coverage. Diffbot operates quietly as a B2B data company — most site owners don't know it exists. As of early 2026, fewer than 10% of major websites block Diffbot, compared to much higher rates for GPTBot.
Related Guides
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.