DeepSeekBot is the training crawler for DeepSeek — the Chinese AI lab whose V3 and R1 models shocked the industry in early 2025 by matching GPT-4 at a fraction of the cost. Here's how to opt out, and why this crawler is different.
DeepSeekBot is the web crawler operated by DeepSeek, the Hangzhou-based AI research lab that became one of the biggest stories in tech in January 2025. DeepSeek-V3 — trained for roughly $6 million — matched the performance of OpenAI's models that cost hundreds of millions to train. DeepSeek-R1 followed shortly after, matching o1's reasoning performance as an open-weight model.
DeepSeekBot crawls publicly available web content to build training datasets for these model families. Active since mid-2024, its crawl volume increased significantly alongside each model release. It targets a broad range of web content — technical documentation, news, forums, code, and general web text — in keeping with DeepSeek's general-purpose model architecture.
What makes DeepSeekBot distinct from other AI training crawlers is jurisdiction. OpenAI, Anthropic, Google, Apple, and Microsoft are US companies. Mistral is French and subject to GDPR. DeepSeek is incorporated in China, outside Western legal frameworks. This doesn't change the robots.txt mechanics, but it does change the legal recourse available to publishers who want to formally enforce opt-out compliance.
Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/about)In robots.txt, use the token DeepSeekBot — single user agent token.
robots.txtUser-agent: DeepSeekBot Disallow: /
# Block DeepSeek's direct crawler User-agent: DeepSeekBot Disallow: / # Block CCBot — Common Crawl feeds DeepSeek and 50+ other AI models User-agent: CCBot Disallow: /
DeepSeek uses Common Crawl data. Blocking CCBot cuts that supply line simultaneously for DeepSeek and dozens of other AI labs.
# Block all major AI training crawlers User-agent: DeepSeekBot Disallow: / User-agent: CCBot Disallow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: MistralBot Disallow: / User-agent: xAI-Bot Disallow: / User-agent: Bytespider Disallow: / User-agent: meta-externalagent Disallow: / User-agent: Applebot-Extended Disallow: / # Search engines — unaffected User-agent: Googlebot Allow: / User-agent: Bingbot Allow: /
import { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{ userAgent: 'DeepSeekBot', disallow: ['/'] },
{ userAgent: 'CCBot', disallow: ['/'] },
{ userAgent: 'GPTBot', disallow: ['/'] },
{ userAgent: 'ClaudeBot', disallow: ['/'] },
{ userAgent: 'anthropic-ai', disallow: ['/'] },
{ userAgent: 'Google-Extended', disallow: ['/'] },
{ userAgent: 'MistralBot', disallow: ['/'] },
{ userAgent: 'xAI-Bot', disallow: ['/'] },
{ userAgent: 'Bytespider', disallow: ['/'] },
{ userAgent: 'Googlebot', allow: ['/'] },
{ userAgent: '*', allow: ['/'] },
],
sitemap: 'https://yoursite.com/sitemap.xml',
};
}Given DeepSeek's Chinese jurisdiction, a server-level block is more defensible than relying solely on robots.txt for publishers with sensitive content or regulatory concerns.
# In your server {} block — hard 403 regardless of robots.txt
if ($http_user_agent ~* "DeepSeekBot") {
return 403;
}Returns HTTP 403 before the request reaches your application. Combine with robots.txt for layered protection.
(http.user_agent contains "DeepSeekBot")
Set the action to Block. Blocks at Cloudflare's edge — your server never sees the request.
Cloudflare Dashboard → Security → WAF → Custom Rules → Create rule
DeepSeek is the only major AI training crawler covered in these guides that operates outside both US and EU legal frameworks. Here's what that means practically:
EU publishers cannot use GDPR Article 21 objection rights against DeepSeek the way they might against Google or Mistral. DeepSeek's incorporation in China places it outside the EU's enforcement jurisdiction for most practical purposes.
The EU AI Act's training data transparency requirements apply to AI providers operating in the EU market. DeepSeek is increasingly accessible in Europe but its obligations under the EU AI Act are still being established — enforcement against a Chinese company has practical limits.
Enterprise security policies and government AI guidelines increasingly treat Chinese AI systems differently. If your organization has a policy around Chinese technology vendors, blocking DeepSeekBot is consistent with that posture — even for public-facing content.
For most publishers, DeepSeekBot behaves like any other AI training crawler — it crawls public content, and robots.txt is sufficient to opt out. The jurisdiction consideration primarily matters for publishers with sensitive content, regulatory requirements, or specific data residency policies.
# Check nginx access logs for DeepSeekBot grep "DeepSeekBot" /var/log/nginx/access.log | tail -20 # Confirm it fetched robots.txt (then stopped) grep "DeepSeekBot" /var/log/nginx/access.log | grep "robots.txt" # If server-level blocked — confirm 403s grep "DeepSeekBot" /var/log/nginx/access.log | grep " 403 "
Seeing DeepSeekBot fetch /robots.txtfollowed by no content requests confirms the block is working. If you see it on content pages after the robots.txt block, add nginx or Cloudflare enforcement.
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.
Scan My Site Free →