About
The crawler behind Common Crawl, a nonprofit that maintains a massive open repository of web crawl data. This dataset is used by many AI companies to train large language models.
Purpose
Open web dataset for AI training and research
User Agent String
CCBot/2.0 (https://commoncrawl.org/faq/)
How to Control in robots.txt
🚫 Block CCBot
User-agent: CCBot Disallow: /
✅ Allow CCBot
User-agent: CCBot Allow: /
Complete Guide: How to Block CCBot
Server-level blocking, nginx configs, Cloudflare rules, Next.js middleware, and more →
Is CCBot crawling your site?
Enter your URL below — scan takes under 5 seconds.
Free · No signup · Instant results