Skip to content
AI Training · DeepSeek (China)

How to Block DeepSeekBot

DeepSeekBot is the training crawler for DeepSeek — the Chinese AI lab whose V3 and R1 models shocked the industry in early 2025 by matching GPT-4 at a fraction of the cost. Here's how to opt out, and why this crawler is different.

⚠ Different jurisdiction
Operates under Chinese law — outside GDPR and US AI regulation. Consider server-level block for sensitive content
Respects robots.txt
DeepSeek claims compliance — robots.txt is the standard opt-out
Block CCBot too
DeepSeek uses Common Crawl datasets — block CCBot for full pipeline coverage

What Is DeepSeekBot?

DeepSeekBot is the web crawler operated by DeepSeek, the Hangzhou-based AI research lab that became one of the biggest stories in tech in January 2025. DeepSeek-V3 — trained for roughly $6 million — matched the performance of OpenAI's models that cost hundreds of millions to train. DeepSeek-R1 followed shortly after, matching o1's reasoning performance as an open-weight model.

DeepSeekBot crawls publicly available web content to build training datasets for these model families. Active since mid-2024, its crawl volume increased significantly alongside each model release. It targets a broad range of web content — technical documentation, news, forums, code, and general web text — in keeping with DeepSeek's general-purpose model architecture.

What makes DeepSeekBot distinct from other AI training crawlers is jurisdiction. OpenAI, Anthropic, Google, Apple, and Microsoft are US companies. Mistral is French and subject to GDPR. DeepSeek is incorporated in China, outside Western legal frameworks. This doesn't change the robots.txt mechanics, but it does change the legal recourse available to publishers who want to formally enforce opt-out compliance.

DeepSeekBot user agent
Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/about)

In robots.txt, use the token DeepSeekBot — single user agent token.

Option 1: Block via robots.txt

Block entire siteStart here
robots.txt
User-agent: DeepSeekBot
Disallow: /
Block DeepSeekBot + CCBot for full DeepSeek coverageRecommended
robots.txt
# Block DeepSeek's direct crawler
User-agent: DeepSeekBot
Disallow: /

# Block CCBot — Common Crawl feeds DeepSeek and 50+ other AI models
User-agent: CCBot
Disallow: /

DeepSeek uses Common Crawl data. Blocking CCBot cuts that supply line simultaneously for DeepSeek and dozens of other AI labs.

Block all major AI training crawlers
robots.txt
# Block all major AI training crawlers
User-agent: DeepSeekBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: MistralBot
Disallow: /

User-agent: xAI-Bot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Search engines — unaffected
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Option 2: Next.js App Router

app/robots.ts
import { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: 'DeepSeekBot', disallow: ['/'] },
      { userAgent: 'CCBot', disallow: ['/'] },
      { userAgent: 'GPTBot', disallow: ['/'] },
      { userAgent: 'ClaudeBot', disallow: ['/'] },
      { userAgent: 'anthropic-ai', disallow: ['/'] },
      { userAgent: 'Google-Extended', disallow: ['/'] },
      { userAgent: 'MistralBot', disallow: ['/'] },
      { userAgent: 'xAI-Bot', disallow: ['/'] },
      { userAgent: 'Bytespider', disallow: ['/'] },
      { userAgent: 'Googlebot', allow: ['/'] },
      { userAgent: '*', allow: ['/'] },
    ],
    sitemap: 'https://yoursite.com/sitemap.xml',
  };
}

Option 3: nginx — Hard 403 Block

Given DeepSeek's Chinese jurisdiction, a server-level block is more defensible than relying solely on robots.txt for publishers with sensitive content or regulatory concerns.

nginx.conf
# In your server {} block — hard 403 regardless of robots.txt
if ($http_user_agent ~* "DeepSeekBot") {
    return 403;
}

Returns HTTP 403 before the request reaches your application. Combine with robots.txt for layered protection.

Option 4: Cloudflare WAF Rule

Cloudflare WAF → Custom Rules → Expression
(http.user_agent contains "DeepSeekBot")

Set the action to Block. Blocks at Cloudflare's edge — your server never sees the request.

Cloudflare Dashboard → Security → WAF → Custom Rules → Create rule

The Jurisdiction Question

DeepSeek is the only major AI training crawler covered in these guides that operates outside both US and EU legal frameworks. Here's what that means practically:

No GDPR obligations

EU publishers cannot use GDPR Article 21 objection rights against DeepSeek the way they might against Google or Mistral. DeepSeek's incorporation in China places it outside the EU's enforcement jurisdiction for most practical purposes.

No EU AI Act compliance

The EU AI Act's training data transparency requirements apply to AI providers operating in the EU market. DeepSeek is increasingly accessible in Europe but its obligations under the EU AI Act are still being established — enforcement against a Chinese company has practical limits.

Enterprise and government risk posture

Enterprise security policies and government AI guidelines increasingly treat Chinese AI systems differently. If your organization has a policy around Chinese technology vendors, blocking DeepSeekBot is consistent with that posture — even for public-facing content.

The practical reality

For most publishers, DeepSeekBot behaves like any other AI training crawler — it crawls public content, and robots.txt is sufficient to opt out. The jurisdiction consideration primarily matters for publishers with sensitive content, regulatory requirements, or specific data residency policies.

Verify Your Block

bash
# Check nginx access logs for DeepSeekBot
grep "DeepSeekBot" /var/log/nginx/access.log | tail -20

# Confirm it fetched robots.txt (then stopped)
grep "DeepSeekBot" /var/log/nginx/access.log | grep "robots.txt"

# If server-level blocked — confirm 403s
grep "DeepSeekBot" /var/log/nginx/access.log | grep " 403 "

Seeing DeepSeekBot fetch /robots.txtfollowed by no content requests confirms the block is working. If you see it on content pages after the robots.txt block, add nginx or Cloudflare enforcement.

Frequently Asked Questions

Does DeepSeekBot respect robots.txt?
DeepSeek claims that DeepSeekBot respects robots.txt. For most publishers, treating it the same as other compliant crawlers is reasonable. For publishers with sensitive content or regulatory concerns, a server-level block via nginx or Cloudflare provides hard enforcement that doesn't depend on DeepSeek's claimed compliance.
What user agent does DeepSeekBot use?
DeepSeekBot's user agent: Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/about). In robots.txt, use DeepSeekBot as the token. Active since mid-2024, with increased crawl volume following DeepSeek-V3 and R1 releases in late 2024 / early 2025.
Why did DeepSeek become such a big deal in 2025?
DeepSeek-V3 (released December 2024) and DeepSeek-R1 (January 2025) matched GPT-4 and o1 performance respectively — at dramatically lower training costs. This challenged assumptions about the capital requirements for frontier AI and made DeepSeek the most discussed AI development in years. The models are open-weight, meaning anyone can run them locally.
Will blocking DeepSeekBot affect my search rankings?
No. DeepSeekBot is a training crawler. DeepSeek does not operate a web search product that indexes your site for public queries. Blocking it has zero effect on Google, Bing, or any search ranking.
Is CCBot enough to block DeepSeek, or do I need DeepSeekBot too?
Both. CCBot blocks DeepSeek's access to Common Crawl datasets, which it uses as a significant training data source. DeepSeekBot blocks DeepSeek's direct web crawls. For full coverage, block both. CCBot is the higher-leverage block — one rule stops data supply to 50+ AI models including DeepSeek.
Should I be worried about DeepSeekBot from a security standpoint?
For public content, DeepSeekBot behaves like any other web crawler — it reads publicly accessible pages. There's no evidence it exploits vulnerabilities or accesses protected resources. The concern is data sovereignty: your public content entering a training pipeline operated by a Chinese company outside Western regulatory frameworks. Whether that matters depends on your content type and organizational policy.

Related Guides

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Scan My Site Free →

Related Guides