How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

Which DeepSeek models does blocking DeepSeekBot protect against?

Blocking DeepSeekBot stops web crawl data from entering the training pipeline for DeepSeek's model families — including DeepSeek-V3, DeepSeek-R1, DeepSeek-R2, and future releases. DeepSeek also trains on Common Crawl datasets via CCBot. For full coverage, block both DeepSeekBot and CCBot.

Why is DeepSeekBot different from other AI crawlers?

The key difference is jurisdiction. US-based crawlers (OpenAI, Anthropic) and EU-based crawlers (Mistral) operate under legal frameworks that create enforceable obligations around data collection and opt-out compliance. DeepSeek is incorporated in China and operates outside GDPR, US AI regulation, and EU AI Act. This doesn't mean it ignores robots.txt, but it does mean publishers have fewer legal mechanisms to enforce compliance if needed.

Should I be concerned about data sovereignty with DeepSeekBot?

This depends on your content and audience. For general web content, the practical risk of DeepSeekBot crawling your site is the same as any other AI training crawler — your content enters a training dataset. For organizations in regulated industries (healthcare, legal, finance) or those handling sensitive content, DeepSeek's Chinese jurisdiction may be an additional consideration beyond standard AI training opt-out concerns. Government and enterprise security policies increasingly include Chinese AI systems in their vendor risk assessments.

Does DeepSeek have a content removal form?

DeepSeek does not currently offer a documented public content removal form comparable to Anthropic's privacy.anthropic.com form. Robots.txt is the primary opt-out mechanism. For formal requests, DeepSeek's contact information is available at deepseek.com, though the practical enforceability of such requests depends on your jurisdiction and legal standing.

Open Shadow ← All Guides

AI Training · DeepSeek (China)

How to Block DeepSeekBot

DeepSeekBot is the training crawler for DeepSeek — the Chinese AI lab whose V3 and R1 models shocked the industry in early 2025 by matching GPT-4 at a fraction of the cost. Here's how to opt out, and why this crawler is different.

⚠ Different jurisdiction

Operates under Chinese law — outside GDPR and US AI regulation. Consider server-level block for sensitive content

Respects robots.txt

DeepSeek claims compliance — robots.txt is the standard opt-out

Block CCBot too

DeepSeek uses Common Crawl datasets — block CCBot for full pipeline coverage

What Is DeepSeekBot?

DeepSeekBot is the web crawler operated by DeepSeek, the Hangzhou-based AI research lab that became one of the biggest stories in tech in January 2025. DeepSeek-V3 — trained for roughly $6 million — matched the performance of OpenAI's models that cost hundreds of millions to train. DeepSeek-R1 followed shortly after, matching o1's reasoning performance as an open-weight model.

DeepSeekBot crawls publicly available web content to build training datasets for these model families. Active since mid-2024, its crawl volume increased significantly alongside each model release. It targets a broad range of web content — technical documentation, news, forums, code, and general web text — in keeping with DeepSeek's general-purpose model architecture.

What makes DeepSeekBot distinct from other AI training crawlers is jurisdiction. OpenAI, Anthropic, Google, Apple, and Microsoft are US companies. Mistral is French and subject to GDPR. DeepSeek is incorporated in China, outside Western legal frameworks. This doesn't change the robots.txt mechanics, but it does change the legal recourse available to publishers who want to formally enforce opt-out compliance.

DeepSeekBot user agent

Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/about)

In robots.txt, use the token DeepSeekBot — single user agent token.

Option 1: Block via `robots.txt`

Block entire siteStart here

robots.txt

User-agent: DeepSeekBot
Disallow: /

Block DeepSeekBot + CCBot for full DeepSeek coverageRecommended

robots.txt

# Block DeepSeek's direct crawler
User-agent: DeepSeekBot
Disallow: /

# Block CCBot — Common Crawl feeds DeepSeek and 50+ other AI models
User-agent: CCBot
Disallow: /

DeepSeek uses Common Crawl data. Blocking CCBot cuts that supply line simultaneously for DeepSeek and dozens of other AI labs.

Block all major AI training crawlers

robots.txt

# Block all major AI training crawlers
User-agent: DeepSeekBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: MistralBot
Disallow: /

User-agent: xAI-Bot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Search engines — unaffected
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Option 2: Next.js App Router

app/robots.ts

import { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: 'DeepSeekBot', disallow: ['/'] },
      { userAgent: 'CCBot', disallow: ['/'] },
      { userAgent: 'GPTBot', disallow: ['/'] },
      { userAgent: 'ClaudeBot', disallow: ['/'] },
      { userAgent: 'anthropic-ai', disallow: ['/'] },
      { userAgent: 'Google-Extended', disallow: ['/'] },
      { userAgent: 'MistralBot', disallow: ['/'] },
      { userAgent: 'xAI-Bot', disallow: ['/'] },
      { userAgent: 'Bytespider', disallow: ['/'] },
      { userAgent: 'Googlebot', allow: ['/'] },
      { userAgent: '*', allow: ['/'] },
    ],
    sitemap: 'https://yoursite.com/sitemap.xml',
  };
}

Option 3: nginx — Hard 403 Block

Given DeepSeek's Chinese jurisdiction, a server-level block is more defensible than relying solely on robots.txt for publishers with sensitive content or regulatory concerns.

nginx.conf

# In your server {} block — hard 403 regardless of robots.txt
if ($http_user_agent ~* "DeepSeekBot") {
    return 403;
}

Returns HTTP 403 before the request reaches your application. Combine with robots.txt for layered protection.

Option 4: Cloudflare WAF Rule

Cloudflare WAF → Custom Rules → Expression

(http.user_agent contains "DeepSeekBot")

Set the action to Block. Blocks at Cloudflare's edge — your server never sees the request.

Cloudflare Dashboard → Security → WAF → Custom Rules → Create rule

The Jurisdiction Question

DeepSeek is the only major AI training crawler covered in these guides that operates outside both US and EU legal frameworks. Here's what that means practically:

No GDPR obligations

EU publishers cannot use GDPR Article 21 objection rights against DeepSeek the way they might against Google or Mistral. DeepSeek's incorporation in China places it outside the EU's enforcement jurisdiction for most practical purposes.

No EU AI Act compliance

The EU AI Act's training data transparency requirements apply to AI providers operating in the EU market. DeepSeek is increasingly accessible in Europe but its obligations under the EU AI Act are still being established — enforcement against a Chinese company has practical limits.

Enterprise and government risk posture

Enterprise security policies and government AI guidelines increasingly treat Chinese AI systems differently. If your organization has a policy around Chinese technology vendors, blocking DeepSeekBot is consistent with that posture — even for public-facing content.

The practical reality

For most publishers, DeepSeekBot behaves like any other AI training crawler — it crawls public content, and robots.txt is sufficient to opt out. The jurisdiction consideration primarily matters for publishers with sensitive content, regulatory requirements, or specific data residency policies.

Verify Your Block

bash

# Check nginx access logs for DeepSeekBot
grep "DeepSeekBot" /var/log/nginx/access.log | tail -20

# Confirm it fetched robots.txt (then stopped)
grep "DeepSeekBot" /var/log/nginx/access.log | grep "robots.txt"

# If server-level blocked — confirm 403s
grep "DeepSeekBot" /var/log/nginx/access.log | grep " 403 "

Seeing DeepSeekBot fetch /robots.txtfollowed by no content requests confirms the block is working. If you see it on content pages after the robots.txt block, add nginx or Cloudflare enforcement.

Frequently Asked Questions

Does DeepSeekBot respect robots.txt?

DeepSeek claims that DeepSeekBot respects robots.txt. For most publishers, treating it the same as other compliant crawlers is reasonable. For publishers with sensitive content or regulatory concerns, a server-level block via nginx or Cloudflare provides hard enforcement that doesn't depend on DeepSeek's claimed compliance.

What user agent does DeepSeekBot use?

DeepSeekBot's user agent: Mozilla/5.0 (compatible; DeepSeekBot/1.0; +https://www.deepseek.com/about). In robots.txt, use DeepSeekBot as the token. Active since mid-2024, with increased crawl volume following DeepSeek-V3 and R1 releases in late 2024 / early 2025.

Why did DeepSeek become such a big deal in 2025?

DeepSeek-V3 (released December 2024) and DeepSeek-R1 (January 2025) matched GPT-4 and o1 performance respectively — at dramatically lower training costs. This challenged assumptions about the capital requirements for frontier AI and made DeepSeek the most discussed AI development in years. The models are open-weight, meaning anyone can run them locally.

Will blocking DeepSeekBot affect my search rankings?

No. DeepSeekBot is a training crawler. DeepSeek does not operate a web search product that indexes your site for public queries. Blocking it has zero effect on Google, Bing, or any search ranking.

Is CCBot enough to block DeepSeek, or do I need DeepSeekBot too?

Both. CCBot blocks DeepSeek's access to Common Crawl datasets, which it uses as a significant training data source. DeepSeekBot blocks DeepSeek's direct web crawls. For full coverage, block both. CCBot is the higher-leverage block — one rule stops data supply to 50+ AI models including DeepSeek.

Should I be worried about DeepSeekBot from a security standpoint?

For public content, DeepSeekBot behaves like any other web crawler — it reads publicly accessible pages. There's no evidence it exploits vulnerabilities or accesses protected resources. The concern is data sovereignty: your public content entering a training pipeline operated by a Chinese company outside Western regulatory frameworks. Whether that matters depends on your content type and organizational policy.