How AI Search Engines Decide What to Surface
What Perplexity, ChatGPT Search, Google AI Overviews, and Claude look for when choosing which pages to cite — and exactly how to make sure yours is one of them.
In This Guide
How AI Search Works (vs Traditional SEO)
Traditional search returns a ranked list of links. You click, you land on a page. AI search — Perplexity, ChatGPT Search, Google AI Overviews — returns a synthesised answer with cited sources. The user often never visits your page at all.
This changes everything about what "visibility" means. In traditional SEO, ranking #7 still sends you clicks. In AI search, only cited sources get attributed — and citation doesn't always correlate with traditional ranking.
Traditional SEO vs AI Search (GEO)
The emerging term for AI search optimisation is GEO (Generative Engine Optimisation), sometimes called AEO (Answer Engine Optimisation). It is not a replacement for SEO — it is a layer on top. Sites with strong technical SEO foundations perform better in AI search too. But there are specific AI-only signals that traditional SEO completely ignores.
The 4 Platforms: How Each Works
Each AI search platform has a different architecture. Understanding the difference helps you prioritise what to fix first.
Perplexity AI
Real-time web search + synthesis
Perplexity crawls the live web before answering every query. It uses PerplexityBot (and sometimes anthropic-ai via Claude models) to fetch pages in real-time, then synthesises across multiple sources. This means your page must be accessible right now — not just indexed historically.
What it prioritises:
- ✓PerplexityBot not blocked in robots.txt — non-negotiable
- ✓Fast page load (slow pages are skipped under time pressure)
- ✓Clean, parseable HTML — minimal JavaScript rendering required
- ✓Specific, citable facts and statistics
- ✓Clear heading hierarchy (H1 → H2 → H3)
- ✓Authoritative domain signals (age, backlinks still matter)
ChatGPT Search (OAI-SearchBot)
Real-time search + trained knowledge hybrid
ChatGPT search uses OAI-SearchBot for real-time web retrieval and ChatGPT-User for browsing on behalf of users. It blends live search results with its pre-trained knowledge. This means even if your page isn't crawled in real-time, your brand can still appear — but citation sources are pulled from live results.
What it prioritises:
- ✓OAI-SearchBot allowed in robots.txt (separate from GPTBot)
- ✓llms.txt for brand/content context between sessions
- ✓High-quality backlink profile (uses Bing index signals)
- ✓Structured data helps parse content type and entities
- ✓Clear "About" and "Who wrote this" signals for trust
Google AI Overviews
Index-based synthesis (Gemini-powered)
Google AI Overviews (formerly SGE) are powered by Gemini and draw from Google's existing search index. You cannot be in AI Overviews if you're not indexed by Google first. Google-Extended is the specific crawler used for Gemini training and AI features — blocking it via robots.txt opts your site out of AI Overviews entirely.
What it prioritises:
- ✓Google-Extended not blocked in robots.txt — if blocked, you're out of AI Overviews
- ✓E-E-A-T signals (Experience, Expertise, Authoritativeness, Trustworthiness)
- ✓Structured data (JSON-LD) for content type recognition
- ✓Core Web Vitals — Google AI Overviews inherit traditional ranking signals
- ✓Concise, direct answers near the top of the page
- ✓FAQPage and HowTo schema especially well-represented in AI Overviews
Claude (Anthropic)
Training-based knowledge + web tools
Claude's base knowledge comes from training data (where ClaudeBot was used to crawl). Claude also has web access tools. Unlike Perplexity, Claude doesn't cite sources in every response by default — but when used in agentic workflows or via web tools, it follows similar signals to Perplexity.anthropic-ai is the newer crawler name used for AI feature indexing.
What it prioritises:
- ✓llms.txt is directly supported — Anthropic's tools explicitly read it
- ✓High-quality, citable factual content in the training corpus
- ✓ClaudeBot / anthropic-ai allowed if you want future training inclusion
- ✓Clean text structure — Claude is highly sensitive to content clarity
The 7 Ranking Signals That Actually Matter
Across all four platforms, these are the signals with the highest leverage. Ranked by impact:
AI Search Bots Not Blocked
● CriticalThis is the one. If PerplexityBot, OAI-SearchBot, or Google-Extended is blocked in your robots.txt — intentionally or accidentally — you are invisible in that platform's answers. Full stop.
The most common mistake: adding a wildcard block (User-agent: * / Disallow: /) that sweeps up search bots alongside training bots. Check your robots.txt carefully. Blocking GPTBot is fine. Blocking PerplexityBot is not.
Structured Data (JSON-LD)
● HighJSON-LD schema tells AI models what type of content they're reading. An Article schema says "this is an editorial piece by an author." A FAQPage schema says "these are questions and answers — cite them." A Product schema says "here are specs and pricing."
Without schema, AI models have to infer content type from raw text — and they often get it wrong, which means your page gets cited in the wrong context or not at all.
Priority schemas for AI search: Article, FAQPage, HowTo, Organization, Product, LocalBusiness.
llms.txt File
● Highllms.txt is a markdown file at your domain root that tells AI assistants exactly what your site is about, what content is valuable, and which pages to prioritise. It's like a site brief you write specifically for AI models.
Anthropic, Perplexity, and several AI agents explicitly read llms.txt during context-building. Sites with a well-written llms.txt get better contextual framing in AI answers — your brand identity is more consistent across AI responses.
Content Clarity and Answer Structure
● HighAI search models extract the most citable, confident-sounding claims from your content. Pages that bury their answer in 500 words of preamble are less likely to be cited than pages that put the direct answer first.
Effective structure for AI citation: • Lead with the direct answer (first 100 words) • Use clear H2/H3 headings that mirror likely search queries • Include specific, verifiable facts and data points • FAQ sections are extremely powerful — they match question-intent queries 1:1 • Short, punchy paragraphs over walls of text
Sitemap.xml
● MediumA sitemap tells all crawlers — including AI search bots — which pages exist and when they were last updated. Without a sitemap, important deep pages may never be discovered or may be discovered stale.
Most modern frameworks auto-generate sitemaps (Next.js has built-in sitemap support; WordPress has Yoast/RankMath). If you don't have one, this is the fastest fix in the list.
Meta Description and Open Graph
● MediumMeta descriptions aren't ranking signals in traditional SEO — but in AI search, they're content signals. AI models read your meta description as a compressed summary of the page. A vague or missing description makes the model work harder to infer what the page is about.
Open Graph tags (og:title, og:description) provide a second layer of content framing. They're parsed by social crawlers, AI summary tools, and link previewers. A well-written og:description can influence how AI tools describe your page in answers.
Domain Authority and Trust Signals
● MediumAI search isn't immune to authority. Perplexity and ChatGPT Search both weight high-authority domains more heavily when there are multiple sources making the same claim. A claim on a site with 10k backlinks beats the same claim on a new domain, all else equal.
However, authority is less dominant in AI search than in traditional SEO. A new site with excellent content structure, schema, llms.txt, and a clean robots.txt can outpunch established sites that haven't adapted to AI search signals.
Training Bots vs Search Bots: The Critical Difference
This is the single most common and most damaging mistake site owners make. Training bots and search bots are completely separate — but they are often blocked together.
Bot Type Reference
The safe pattern: Block GPTBot, ClaudeBot, CCBot, Bytespider (training bots) — your content stays out of AI training corpora. Keep PerplexityBot, OAI-SearchBot, ChatGPT-User, Google-Extended, anthropic-ai allowed — you stay visible in AI search answers.
Use our robots.txt Generator to build exactly this configuration with one click (preset: "Block AI Training Only").
AI Search Optimisation Checklist
Work through this in order. Items near the top have the highest leverage.
Check your score automatically
The AI Search Visibility Checker scores your site against 7 of these checks in under 10 seconds.
FAQ
QHow does Perplexity decide which pages to cite?▼
Perplexity crawls the live web before answering every query. It prioritises pages that are accessible to its bot (not blocked in robots.txt), load quickly, have clear structured content, contain specific factual claims, and have a coherent heading hierarchy. Authority signals (domain age, backlinks) are a secondary factor.
QDoes having an llms.txt file help with AI search ranking?▼
It helps with citation framing more than raw ranking. llms.txt gives AI assistants a curated description of your site and its most important pages — leading to more accurate, consistent brand representation in AI answers. Anthropic's tools and several AI agents explicitly read it.
QWhat is the difference between traditional SEO and GEO (Generative Engine Optimisation)?▼
Traditional SEO targets blue-link rankings. GEO targets citations inside AI-generated answers. GEO rewards content clarity, schema markup, and AI bot access over keyword density and link quantity. The two are complementary — strong technical SEO helps GEO — but GEO requires specific additional signals.
QWill blocking AI training bots hurt my AI search ranking?▼
No. Training bots (GPTBot, CCBot, Bytespider) and search bots (PerplexityBot, OAI-SearchBot, Google-Extended) are completely separate. You can block training crawlers freely. The mistake is accidentally blocking search bots at the same time — which happens when people use overly broad wildcard rules.
QHow do I check if my site is set up for AI search?▼
Use the Open Shadow AI Search Visibility Checker (/tools/ai-visibility). It scores 7 key signals in under 10 seconds and gives you specific, prioritised fixes.
QHow long does it take for changes to affect AI search visibility?▼
Perplexity is near-real-time — changes can reflect within hours or days. ChatGPT Search re-crawls on its own schedule (typically days to weeks). Google AI Overviews reflect changes when Google re-indexes your page, which for established sites is typically days. Training-based knowledge (Claude base model) only updates with new training runs — typically months.
Related Tools & Guides
Is your site protected from AI bots?
Run a free scan to check your robots.txt, meta tags, and overall AI readiness score.
Scan My Site Free →