Skip to content
Allen Institute for AIRespects robots.txtResearch + Training

How to Block AI2Bot: Allen Institute's Two AI Crawlers Explained

AI2 operates two separate crawlers: AI2Bot for academic research indexing and Ai2Bot-Dolma for building the open-source Dolma training dataset. Different purposes — different blocking decisions.

Updated March 2026

Two Bots, Two Different Decisions

The Allen Institute for AI (AI2) is a nonprofit research lab in Seattle. Unlike commercial AI companies, AI2's work is primarily open-source and academic. Their two crawlers serve different purposes:

AI2BotGeneral research crawler. Indexes content for Semantic Scholar and other academic tools.
Ai2Bot-DolmaTraining data crawler. Collects content for the Dolma dataset, used to train OLMo models.

What Does AI2Bot Do?

AI2Bot is the Allen Institute's general web crawler. Its primary purpose is feeding Semantic Scholar — a free, AI-powered academic search engine that indexes over 200 million scholarly papers and their web-based references. If you publish academic or research content, Semantic Scholar may index it via AI2Bot.

AI2Bot also supports other AI2 research initiatives. Unlike commercial crawlers, AI2Bot's output primarily benefits the academic community rather than generating revenue.

What Does Ai2Bot-Dolma Do?

Ai2Bot-Dolma is a specialized crawler built to collect data for the Dolma dataset — a massive open-source pretraining corpus containing approximately 3 trillion tokens. Dolma sources data from web pages (via Common Crawl and direct crawling), code repositories, academic papers, and encyclopedic content.

Dolma was used to train the OLMo (Open Language Model) family — AI2's openly-released, fully-documented language models. Because both the dataset and the models are open-source, your content could influence not just OLMo, but any downstream model fine-tuned from it.

How to Block AI2Bot and Ai2Bot-Dolma

You can block each crawler independently. Here's the recommended configuration:

robots.txtBlock training only, allow research
# Block AI training data collection
User-agent: Ai2Bot-Dolma
Disallow: /

# Allow academic research indexing (Semantic Scholar)
User-agent: AI2Bot
Allow: /
robots.txtBlock both AI2 crawlers
User-agent: AI2Bot
Disallow: /

User-agent: Ai2Bot-Dolma
Disallow: /

Note: Case sensitivity matters

The user agent tokens are case-sensitive in some implementations. Use AI2Bot (capital A, I, 2, capital B) and Ai2Bot-Dolma (capital A, lowercase i, 2, capital B, hyphen, capital D) exactly as shown.

The Academic Research Nuance

AI2Bot is unique among AI crawlers because it serves a genuine academic purpose. Blocking it has different implications depending on your content:

🎓
Academic publishers & researchers
Allowing AI2Bot means your papers and research may appear in Semantic Scholar, increasing discoverability and citations. Most researchers benefit from this.
📰
News and media sites
AI2Bot indexing your journalism for academic research tools is relatively low-risk compared to training crawlers. The content typically appears as a citation, not a full reproduction.
🔒
Paywalled content providers
Even academic indexing may surface content summaries. If your paywall is your business model, blocking both crawlers is the conservative choice.

What Blocking Does (and Doesn't) Do

What it stops
  • • AI2Bot: Your content appearing in Semantic Scholar
  • • Ai2Bot-Dolma: Your content entering the Dolma dataset
  • • Future OLMo model training on your content
  • • Downstream models built on Dolma/OLMo using your data
What it doesn't stop
  • • Content already collected for Dolma
  • • Common Crawl data (Dolma's primary source) — block CCBot separately
  • • Other AI crawlers (GPTBot, ClaudeBot, etc.)
  • • Google or Bing rankings (completely unaffected)

Frequently Asked Questions

Is AI2 the same as Allen AI?

Yes. The Allen Institute for AI (AI2) is commonly referred to as "Allen AI." It was founded in 2014 by Paul Allen (co-founder of Microsoft) and is headquartered in Seattle. It's a nonprofit research institute focused on AI research for the common good.

Does blocking Ai2Bot-Dolma actually help if Dolma already has my data?

Blocking prevents future crawls from adding new content. However, the initial Dolma dataset was built primarily from Common Crawl archives, which are publicly available. If your content was in Common Crawl, it may already be in Dolma regardless of your robots.txt for Ai2Bot-Dolma. Block CCBot to prevent future Common Crawl inclusion as well.

Should I treat AI2Bot differently from commercial crawlers?

That's a philosophical decision. AI2 is a nonprofit doing open-source research, which some publishers view differently from commercial AI labs monetizing their content. Others apply a blanket policy: no AI crawling is allowed regardless of the operator's mission. Both positions are valid.

Will blocking affect my SEO?

No. AI2Bot and Ai2Bot-Dolma are completely separate from Google, Bing, and all traditional search engines. Blocking has zero effect on your search rankings.

Related Guides

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.

Related Guides