How to Block AI Bots on Jekyll & GitHub Pages
Jekyll's clean HTML output and GitHub Pages' open accessibility make them favourite targets for AI training crawlers. Here's how to lock out 25+ AI bots with a static robots.txt, noai meta tags, and Cloudflare WAF — no server access needed.
Jekyll robots.txt is simpler than most CMSs — but there's one gotcha
Unlike WordPress or Magento, Jekyll has no admin panel and no robots.txt generator. You just create a robots.txt file in your project root and commit it. Jekyll copies it to _site/robots.txt unchanged. Don't add YAML front matter (---) to robots.txt — if you do, Jekyll will try to process it as a template, which can break the output. Plain text only.
Quick fix — create robots.txt in your Jekyll root
Same folder as _config.yml. Commit and push — GitHub Pages deploys immediately.
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: /
Available Methods
Static robots.txt in project root (Recommended)
Easyrobots.txt (root of Jekyll project, same level as _config.yml)
Jekyll passes any root-level file through to _site/ unchanged. A plain robots.txt with no front matter is copied as-is and served at your domain. Works on all Jekyll setups including GitHub Pages.
No front matter needed — if you add --- front matter, Jekyll may process it. Keep it as plain text.
noai meta tag in base layout
Easy_layouts/default.html (or your base layout)
Add the noai meta tag inside <head> in your base layout file. Applies to every page that uses that layout. Works with Jekyll themes.
For gem-based themes, copy the theme layout into your own _layouts/ to override it.
Per-page noai via front matter
Easy_layouts/default.html + individual page front matter
Use a custom front matter variable (noai: true) combined with a Liquid conditional in your layout. Gives per-page control — protect only specific posts or pages.
Useful if you want to allow AI search indexing for most content but protect certain premium posts.
Cloudflare WAF (custom domains only)
IntermediateCloudflare Dashboard → Security → WAF → Custom Rules
Proxy your custom domain through Cloudflare to block AI crawlers at the edge. The only method that stops bots ignoring robots.txt (like Bytespider). Not available for github.io subdomains.
Requires a custom domain — you cannot proxy github.io through Cloudflare.
Method 1: Static robots.txt (Recommended)
Create robots.txt in the root of your Jekyll project — the same directory as _config.yml, index.md, and your _layouts/ folder. Jekyll will copy it verbatim to _site/robots.txt.
- 1
In your Jekyll project root, create a new file called
robots.txt. No front matter, no YAML dashes — just plain text. - 2
Add the full AI bot block list:
User-agent: * Allow: / User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Google-Extended Disallow: / User-agent: Bytespider Disallow: / User-agent: CCBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: meta-externalagent Disallow: / User-agent: Amazonbot Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: xAI-Bot Disallow: / User-agent: DeepSeekBot Disallow: / User-agent: MistralBot Disallow: / User-agent: Diffbot Disallow: / User-agent: cohere-ai Disallow: / User-agent: AI2Bot Disallow: / User-agent: Ai2Bot-Dolma Disallow: / User-agent: YouBot Disallow: / User-agent: DuckAssistBot Disallow: / User-agent: omgili Disallow: / User-agent: omgilibot Disallow: / User-agent: webzio-extended Disallow: / User-agent: gemini-deep-research Disallow: /
- 3
Commit and push:
git add robots.txt git commit -m "Block AI training bots via robots.txt" git push origin main
- 4
GitHub Pages deploys within 1–2 minutes. Verify at
https://username.github.io/robots.txtor your custom domain.
_config.yml has an exclude: list and robots.txt is on it, Jekyll will skip copying it to _site/. Remove it from the exclude list. Also check if you have a include: list — if it exists and robots.txt is not in it, Jekyll may ignore it. Add robots.txt to include explicitly if needed.Method 2: noai Meta Tag via Layout
Add the noai and noimageai meta tags to every page by editing your base layout file. This tells AI crawlers not to use your content for training, even when they do visit.
- 1
Open
_layouts/default.html(or your base layout — check thelayout:key in your pages' front matter to find it). - 2
Find the
</head>closing tag and add the noai meta tag just before it:<meta name="robots" content="noai, noimageai"> </head>
- 3
Commit and push. GitHub Pages will rebuild and the tag will appear on every page using that layout.
_layouts/ folder in your repo, copy the layout file from the gem into it (e.g. ~/.gem/ruby/VERSION/gems/minima-VERSION/_layouts/default.html), and edit your copy. Jekyll will prefer your local version over the gem's.Per-page control with front matter:
To protect only specific pages (useful if you want most content indexed by AI search but certain posts private):
<!-- In _layouts/default.html, inside <head>: -->
{% if page.noai %}
<meta name="robots" content="noai, noimageai">
{% endif %}
<!-- Then in any page's front matter: -->
---
layout: default
title: My Protected Post
noai: true
---Method 3: Cloudflare WAF (Custom Domains)
If your Jekyll site uses a custom domain (not a github.io subdomain), you can proxy it through Cloudflare to block AI bots at the network edge before requests reach GitHub Pages. This is the only method that stops bots that ignore robots.txt (like Bytespider).
username.github.io through Cloudflare. Cloudflare requires that you control the domain's DNS. For github.io sites, robots.txt and noai meta tags are your only options.Setup for custom domain on GitHub Pages
- 1. Add your custom domain to GitHub Pages: repo Settings → Pages → Custom domain.
- 2. In Cloudflare DNS, add a CNAME record:
@ → username.github.io(or the apex A records GitHub provides). Enable the orange proxy cloud ☁️. - 3. In Cloudflare: Security → WAF → Custom Rules → Create rule.
- 4. Set the expression:
(http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot")etc. - 5. Action: Block. Deploy. Free plan supports basic string rules.
(http.user_agent contains "GPTBot") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "CCBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "Diffbot") or (http.user_agent contains "meta-externalagent") or (http.user_agent contains "DeepSeekBot")
Using Jekyll with GitHub Actions?
If you're using a custom GitHub Actions workflow to build and deploy Jekyll (rather than the default GitHub Pages build), make sure your workflow copies robots.txt to the build output. In most setups this happens automatically since Jekyll passes through root-level files. Verify by checking the generated _site/ directory in your workflow artifacts.
# In your GitHub Actions workflow, after jekyll build:
# Verify robots.txt is in _site/
- name: Verify robots.txt
run: |
if [ -f "_site/robots.txt" ]; then
echo "robots.txt present in _site/"
head -5 _site/robots.txt
else
echo "WARNING: robots.txt missing from _site/"
exit 1
fiFull AI Bot Reference
All 25 AI bots covered by the robots.txt block list above:
Frequently Asked Questions
Where do I put robots.txt in a Jekyll site?↓
Put robots.txt in the root of your Jekyll project (the same folder as _config.yml and index.md). Jekyll copies any file in the root that doesn't start with an underscore directly to the _site/ output directory. So a robots.txt in your project root becomes _site/robots.txt and is served at yourdomain.com/robots.txt. Important: add robots.txt to your _config.yml exclude list only if you do NOT want it processed by Jekyll — for a plain text robots.txt file, no front matter is needed and Jekyll will pass it through unchanged.
Does GitHub Pages support custom robots.txt?↓
Yes. GitHub Pages serves any file in your repository's root (or docs/ folder, depending on your Pages source setting). Simply commit a robots.txt file to your repository root and GitHub Pages will serve it at your-username.github.io/robots.txt or your-custom-domain.com/robots.txt. There is no admin panel — just commit the file and push. GitHub Actions will build and deploy the change automatically if you're using a custom workflow, or GitHub Pages will pick it up immediately for static file serving.
How do I add a noai meta tag to every Jekyll page?↓
Edit your base layout file: _layouts/default.html (or whatever your base layout is called — check your pages' layout: front matter). Find the closing </head> tag and add <meta name="robots" content="noai, noimageai"> just before it. This will apply to every page that uses that layout. For Jekyll themes installed as gems, you may need to override the layout by copying it into your own _layouts/ directory first.
Can I use Jekyll front matter to control AI bot access per page?↓
Yes. You can add a custom front matter variable (e.g. noai: true) to specific pages, then conditionally render the noai meta tag in your layout. In _layouts/default.html, add: {% if page.noai %}<meta name="robots" content="noai, noimageai">{% endif %} inside the <head> block. Then set noai: true in the front matter of any page you want to protect. This gives per-page control without affecting your entire site's AI visibility.
Will blocking AI bots affect GitHub Copilot training?↓
The relationship between robots.txt and GitHub Copilot training is indirect. Copilot trains primarily on public GitHub repositories (via the GitHub code corpus), not on your website's HTML. Blocking web crawlers like GPTBot or CCBot in robots.txt affects scraping of your deployed website, not your repository code. If you want to opt out of GitHub Copilot training at the repository level, you need to set your repository to private or check GitHub's current opt-out mechanisms in your repository settings.
How do I block AI bots on GitHub Pages without a custom domain?↓
The approach is the same with or without a custom domain: commit a robots.txt file to your repository root (or docs/ folder if that's your Pages source). The file will be served at username.github.io/robots.txt. Cloudflare WAF requires a custom domain since github.io is not proxied through Cloudflare. For github.io subdomains, robots.txt and noai meta tags are your only options — server-level blocking is not available.
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.