How is this different from Google Analytics?

Google Analytics shows you traffic. Shadow shows you traffic, AI bot activity, what AI platforms say about your brand, AND tells you what to do about all of it. It's analytics + AI intelligence + action steps in one tool.

Do I need to install anything?

For basic monitoring (bot detection, AI perception, readiness score) — nope, just enter your URL. For full visitor analytics (clicks, behavior, sessions), add one script tag. One-click integrations for Vercel, Shopify, WordPress, and more.

Will it slow down my site?

No. The script is under 5KB and loads async. Zero impact on page speed or Core Web Vitals. External monitoring has literally no impact — it watches from the outside.

What AI bots does Shadow detect?

All of them. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bytespider, Amazonbot, and dozens more. The Shadow Network means new bots get identified across all users instantly.

What do you mean by "actionable steps"?

Shadow doesn't just show you graphs. It says things like: "ChatGPT has your pricing wrong — add structured data to /pricing to fix it" or "Your bounce rate on /features is 68% — here's why and what to change." Specific, do-it-today recommendations.

Can Shadow block bots?

Shadow is a telescope, not a shield. It shows you who's visiting and what AI says about you. It generates block rules and robots.txt configs you can apply — but it doesn't intercept traffic.

Yes. Shadow never collects PII. IP addresses are hashed after classification. No cookies on your visitors. All Shadow Network data is anonymized. GDPR compliant by design.

JekyllGitHub PagesNew8 min read

How to Block AI Bots on Jekyll & GitHub Pages

Jekyll's clean HTML output and GitHub Pages' open accessibility make them favourite targets for AI training crawlers. Here's how to lock out 25+ AI bots with a static robots.txt, noai meta tags, and Cloudflare WAF — no server access needed.

Jekyll robots.txt is simpler than most CMSs — but there's one gotcha

Unlike WordPress or Magento, Jekyll has no admin panel and no robots.txt generator. You just create a robots.txt file in your project root and commit it. Jekyll copies it to _site/robots.txt unchanged. Don't add YAML front matter (---) to robots.txt — if you do, Jekyll will try to process it as a template, which can break the output. Plain text only.

Quick fix — create robots.txt in your Jekyll root

Same folder as _config.yml. Commit and push — GitHub Pages deploys immediately.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Available Methods

Static robots.txt in project root (Recommended)

Easy

robots.txt (root of Jekyll project, same level as _config.yml)

Jekyll passes any root-level file through to _site/ unchanged. A plain robots.txt with no front matter is copied as-is and served at your domain. Works on all Jekyll setups including GitHub Pages.

No front matter needed — if you add --- front matter, Jekyll may process it. Keep it as plain text.

noai meta tag in base layout

Easy

_layouts/default.html (or your base layout)

Add the noai meta tag inside <head> in your base layout file. Applies to every page that uses that layout. Works with Jekyll themes.

For gem-based themes, copy the theme layout into your own _layouts/ to override it.

Per-page noai via front matter

Easy

_layouts/default.html + individual page front matter

Use a custom front matter variable (noai: true) combined with a Liquid conditional in your layout. Gives per-page control — protect only specific posts or pages.

Useful if you want to allow AI search indexing for most content but protect certain premium posts.

Cloudflare WAF (custom domains only)

Intermediate

Cloudflare Dashboard → Security → WAF → Custom Rules

Proxy your custom domain through Cloudflare to block AI crawlers at the edge. The only method that stops bots ignoring robots.txt (like Bytespider). Not available for github.io subdomains.

Requires a custom domain — you cannot proxy github.io through Cloudflare.

Method 1: Static robots.txt (Recommended)

Create robots.txt in the root of your Jekyll project — the same directory as _config.yml, index.md, and your _layouts/ folder. Jekyll will copy it verbatim to _site/robots.txt.

1
In your Jekyll project root, create a new file called robots.txt. No front matter, no YAML dashes — just plain text.
2
Add the full AI bot block list:

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: xAI-Bot
Disallow: /

User-agent: DeepSeekBot
Disallow: /

User-agent: MistralBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: AI2Bot
Disallow: /

User-agent: Ai2Bot-Dolma
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: DuckAssistBot
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: webzio-extended
Disallow: /

User-agent: gemini-deep-research
Disallow: /

Commit and push:

git add robots.txt
git commit -m "Block AI training bots via robots.txt"
git push origin main

4
GitHub Pages deploys within 1–2 minutes. Verify at https://username.github.io/robots.txt or your custom domain.

Check your _config.yml exclude list: If your _config.yml has an exclude: list and robots.txt is on it, Jekyll will skip copying it to _site/. Remove it from the exclude list. Also check if you have a include: list — if it exists and robots.txt is not in it, Jekyll may ignore it. Add robots.txt to include explicitly if needed.

Method 2: noai Meta Tag via Layout

Add the noai and noimageai meta tags to every page by editing your base layout file. This tells AI crawlers not to use your content for training, even when they do visit.

1
Open _layouts/default.html (or your base layout — check the layout: key in your pages' front matter to find it).
2
Find the </head> closing tag and add the noai meta tag just before it:
```
  <meta name="robots" content="noai, noimageai">
</head>
```
3
Commit and push. GitHub Pages will rebuild and the tag will appear on every page using that layout.

Using a gem-based theme? If your site uses a theme installed as a Ruby gem (common with GitHub Pages default themes like Minima), the layout files live inside the gem, not your repository. To override a layout: create a _layouts/ folder in your repo, copy the layout file from the gem into it (e.g. ~/.gem/ruby/VERSION/gems/minima-VERSION/_layouts/default.html), and edit your copy. Jekyll will prefer your local version over the gem's.

Per-page control with front matter:

To protect only specific pages (useful if you want most content indexed by AI search but certain posts private):

<!-- In _layouts/default.html, inside <head>: -->
{% if page.noai %}
  <meta name="robots" content="noai, noimageai">
{% endif %}

<!-- Then in any page's front matter: -->
---
layout: default
title: My Protected Post
noai: true
---

Method 3: Cloudflare WAF (Custom Domains)

If your Jekyll site uses a custom domain (not a github.io subdomain), you can proxy it through Cloudflare to block AI bots at the network edge before requests reach GitHub Pages. This is the only method that stops bots that ignore robots.txt (like Bytespider).

github.io subdomains: You cannot proxy username.github.io through Cloudflare. Cloudflare requires that you control the domain's DNS. For github.io sites, robots.txt and noai meta tags are your only options.

Setup for custom domain on GitHub Pages

1. Add your custom domain to GitHub Pages: repo Settings → Pages → Custom domain.
2. In Cloudflare DNS, add a CNAME record: @ → username.github.io (or the apex A records GitHub provides). Enable the orange proxy cloud ☁️.
3. In Cloudflare: Security → WAF → Custom Rules → Create rule.
4. Set the expression: (http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") etc.
5. Action: Block. Deploy. Free plan supports basic string rules.

Cloudflare expression for major AI bots:

(http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "CCBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "Diffbot") or (http.user_agent contains "meta-externalagent") or (http.user_agent contains "DeepSeekBot")

Using Jekyll with GitHub Actions?

If you're using a custom GitHub Actions workflow to build and deploy Jekyll (rather than the default GitHub Pages build), make sure your workflow copies robots.txt to the build output. In most setups this happens automatically since Jekyll passes through root-level files. Verify by checking the generated _site/ directory in your workflow artifacts.

# In your GitHub Actions workflow, after jekyll build:
# Verify robots.txt is in _site/
- name: Verify robots.txt
  run: |
    if [ -f "_site/robots.txt" ]; then
      echo "robots.txt present in _site/"
      head -5 _site/robots.txt
    else
      echo "WARNING: robots.txt missing from _site/"
      exit 1
    fi

Full AI Bot Reference

All 25 AI bots covered by the robots.txt block list above:

GPTBotChatGPT-UserOAI-SearchBotClaudeBotanthropic-aiGoogle-ExtendedBytespiderCCBotPerplexityBotmeta-externalagentAmazonbotApplebot-ExtendedxAI-BotDeepSeekBotMistralBotDiffbotcohere-aiAI2BotAi2Bot-DolmaYouBotDuckAssistBotomgiliomgilibotwebzio-extendedgemini-deep-research

Frequently Asked Questions

Where do I put robots.txt in a Jekyll site?↓

Put robots.txt in the root of your Jekyll project (the same folder as _config.yml and index.md). Jekyll copies any file in the root that doesn't start with an underscore directly to the _site/ output directory. So a robots.txt in your project root becomes _site/robots.txt and is served at yourdomain.com/robots.txt. Important: add robots.txt to your _config.yml exclude list only if you do NOT want it processed by Jekyll — for a plain text robots.txt file, no front matter is needed and Jekyll will pass it through unchanged.

Does GitHub Pages support custom robots.txt?↓

Yes. GitHub Pages serves any file in your repository's root (or docs/ folder, depending on your Pages source setting). Simply commit a robots.txt file to your repository root and GitHub Pages will serve it at your-username.github.io/robots.txt or your-custom-domain.com/robots.txt. There is no admin panel — just commit the file and push. GitHub Actions will build and deploy the change automatically if you're using a custom workflow, or GitHub Pages will pick it up immediately for static file serving.

How do I add a noai meta tag to every Jekyll page?↓

Edit your base layout file: _layouts/default.html (or whatever your base layout is called — check your pages' layout: front matter). Find the closing </head> tag and add <meta name="robots" content="noai, noimageai"> just before it. This will apply to every page that uses that layout. For Jekyll themes installed as gems, you may need to override the layout by copying it into your own _layouts/ directory first.

Can I use Jekyll front matter to control AI bot access per page?↓

Yes. You can add a custom front matter variable (e.g. noai: true) to specific pages, then conditionally render the noai meta tag in your layout. In _layouts/default.html, add: {% if page.noai %}<meta name="robots" content="noai, noimageai">{% endif %} inside the <head> block. Then set noai: true in the front matter of any page you want to protect. This gives per-page control without affecting your entire site's AI visibility.

Will blocking AI bots affect GitHub Copilot training?↓

The relationship between robots.txt and GitHub Copilot training is indirect. Copilot trains primarily on public GitHub repositories (via the GitHub code corpus), not on your website's HTML. Blocking web crawlers like GPTBot or CCBot in robots.txt affects scraping of your deployed website, not your repository code. If you want to opt out of GitHub Copilot training at the repository level, you need to set your repository to private or check GitHub's current opt-out mechanisms in your repository settings.

How do I block AI bots on GitHub Pages without a custom domain?↓

The approach is the same with or without a custom domain: commit a robots.txt file to your repository root (or docs/ folder if that's your Pages source). The file will be served at username.github.io/robots.txt. Cloudflare WAF requires a custom domain since github.io is not proxied through Cloudflare. For github.io subdomains, robots.txt and noai meta tags are your only options — server-level blocking is not available.

Is your site protected from AI bots?

Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.