July 1, 2026 · 14 min read · AI Agencyrobots.txtGEOAI Crawlers

robots.txt for AI Crawlers: The 2026 Configuration Guide

How to configure your robots.txt file for AI crawlers in 2026. Covers the critical difference between training bots and search bots, which crawlers to allow and block, LLMsTxt directives, and a layered defense strategy for managing GPTBot, ClaudeBot, PerplexityBot, and more.

Shubhamraj Singh Product Manager · Program Manager · Marketing Strategist

robots.txt Is the Only File with Real Enforcement Power

In the growing ecosystem of AI governance files, robots.txt remains the only one with real teeth. ai.txt declares your preferences. llms.txt curates your content. But robots.txt is the file that AI crawlers actually obey when deciding whether to access your pages.

This distinction matters enormously. A well-configured robots.txt file is the difference between strategic AI visibility and either total invisibility or uncontrolled content consumption. And in 2026, the configuration required is far more nuanced than it was even a year ago. The days of simple “allow all” or “block all” directives are over. The new standard is selective access, allowing search and retrieval bots while restricting training bots, and doing so with precision.

After configuring robots.txt files for dozens of client websites through our AI Agency practice, I can tell you that most organisations get this wrong. They either block everything (losing AI search visibility) or allow everything (surrendering their content to training datasets). This guide walks through the configuration that actually serves your business interests.

The Critical Distinction: Training Bots vs Search Bots

The most important concept for robots.txt configuration in 2026 is the distinction between two categories of AI crawlers. Understanding this distinction is the foundation of every decision you make in your robots.txt file.

Training Crawlers

Training crawlers visit your website to collect content that will be incorporated into AI training datasets. Your blog posts, documentation, and articles become training data for future model versions. Once your content enters a training dataset, you lose all control over how it is used. The model may generate responses that paraphrase, summarise, or derive from your content without any attribution or link back to your site.

For most businesses, allowing training crawlers provides minimal direct benefit. Your content improves someone else’s AI model, but you receive no traffic, citation, or attribution in return. The exception is organisations that actively want their content embedded into AI models for thought leadership or standard-setting purposes.

The major training crawlers to be aware of:

GPTBot (OpenAI). This is OpenAI’s general-purpose web crawler that collects data for training future GPT models. Separate from their search-specific crawlers. Blocking GPTBot does not affect your visibility in ChatGPT Search.

ClaudeBot (Anthropic). Anthropic’s training data crawler. Like GPTBot, blocking this does not affect your visibility in Claude’s search or web browsing features.

CCBot (Common Crawl). The Common Crawl project maintains a massive open web archive that is widely used as a training data source by AI companies. Many organisations block CCBot because its dataset is freely available to any AI company for training purposes.

Google-Extended (Google). This is Google’s dedicated crawler for collecting training data for Gemini and other Google AI models. Importantly, blocking Google-Extended does not affect your regular Google Search indexing or your visibility in Google AI Overviews.

Meta-ExternalAgent (Meta). Meta’s crawler for collecting web data used in training Llama and other Meta AI models. Blocking this does not affect any Meta platform functionality.

Search and Retrieval Crawlers

Search crawlers visit your website to index content that will be retrieved and cited in AI-generated answers. When a user asks Perplexity a question and Perplexity cites your website in its answer, a search crawler made that possible. These crawlers are the mechanism through which your content gains visibility in AI search engines.

Blocking search crawlers directly reduces your visibility in AI-powered search results. For most businesses focused on Generative Engine Optimization, blocking these crawlers is counterproductive.

The major search and retrieval crawlers to allow:

OAI-SearchBot (OpenAI). This is OpenAI’s search-specific crawler, separate from GPTBot. It retrieves content for ChatGPT’s search feature. Allowing OAI-SearchBot while blocking GPTBot gives you visibility in ChatGPT Search without contributing to model training.

ChatGPT-User (OpenAI). This crawler is triggered when a ChatGPT user explicitly browses the web during a conversation. It acts on behalf of the user, retrieving specific pages in real time. Blocking this removes your content from ChatGPT’s live browsing capability.

PerplexityBot (Perplexity). Perplexity’s crawler for indexing and retrieving content for their AI search engine. Perplexity is one of the most transparent AI search engines about citation and attribution, making it a high-value channel for AEO strategies.

Claude-Web (Anthropic). Anthropic’s search and retrieval crawler for Claude’s web browsing feature. Separate from ClaudeBot (the training crawler).

The 2026 robots.txt Configuration

Here is the robots.txt configuration template I recommend for most business websites. This configuration allows search and retrieval crawlers while blocking training crawlers, includes sitemap and LLMs.txt references, and maintains standard search engine access.

# robots.txt - AI Crawler Configuration
# Last updated: 2026-07-01
# Configuration strategy: Allow search/retrieval, block training

# ============================================
# STANDARD SEARCH ENGINES - Allow full access
# ============================================
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Yandex
Allow: /

# ============================================
# AI SEARCH/RETRIEVAL BOTS - Allow (these drive AI search visibility)
# ============================================
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-Web
Allow: /

# ============================================
# AI TRAINING BOTS - Block (these consume content for model training)
# ============================================
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# ============================================
# DEFAULT - Allow all other crawlers
# ============================================
User-agent: *
Allow: /

# ============================================
# SITEMAP AND AI GOVERNANCE REFERENCES
# ============================================
Sitemap: https://yourdomain.com/sitemap.xml
LLMsTxt: https://yourdomain.com/llms.txt
LLMsFullTxt: https://yourdomain.com/llms-full.txt
AI-Policy: https://yourdomain.com/ai.txt

Let me walk through each section and the reasoning behind it.

Section-by-Section Breakdown

Standard Search Engines

Always explicitly allow standard search engine crawlers. While User-agent: * with Allow: / technically covers these, explicit declarations prevent accidental blocks if you later add restrictive rules. This section is straightforward and rarely needs modification.

AI Search and Retrieval Bots

These are the crawlers that power AI search results and citations. Allowing them is essential for any organisation pursuing GEO or AEO strategies. Each of these crawlers represents a channel where your content can appear in AI-generated answers, complete with citations and attribution.

I recommend allowing all four major search crawlers unless you have a specific business reason to block one. For example, if your content is behind a paywall and you do not want AI search engines surfacing it in free answers, you might block PerplexityBot and OAI-SearchBot. But for most content-driven websites, the visibility benefit outweighs any concerns.

AI Training Bots

Blocking training crawlers is the default recommendation for most businesses. Your content marketing investments should drive traffic and authority for your brand, not improve someone else’s AI model. Blocking these five crawlers covers the majority of active training data collection.

Note that blocking training crawlers does not eliminate the possibility that your content has already been included in training datasets. Models trained on web data collected before you added these blocks may already contain your content. robots.txt is forward-looking, it prevents future crawling, not retroactive use.

The Default Rule: Never Use Disallow: / Under User-agent: *

This is the most critical rule in the entire configuration. Never set Disallow: / under User-agent: *. This blocks every crawler that does not have its own explicit rule, including search engines you have not heard of, accessibility tools, monitoring services, and legitimate bots you want to allow.

The correct default is Allow: / under User-agent: *, which permits access to any crawler not specifically addressed. You then use targeted Disallow rules for specific crawlers you want to block. This approach ensures new, beneficial crawlers can access your site while known training crawlers are blocked.

I have seen multiple client websites with Disallow: / under User-agent: * that were effectively invisible to every AI search engine, including the ones they wanted visibility in. This single misconfiguration can undo months of GEO and content strategy work.

Sitemap and AI Governance References

Always include a Sitemap reference in your robots.txt file. This helps all crawlers (traditional and AI) discover your content efficiently. The sitemap is the roadmap that ensures crawlers find every page you want indexed.

The LLMsTxt and LLMsFullTxt directives are newer additions that reference your llms.txt files. These directives help AI crawlers discover your LLM-optimised content guide. Not all AI systems honour these directives yet, but including them is forward-compatible and costs nothing.

The AI-Policy directive references your ai.txt file, helping AI crawlers discover your usage policy declarations. As the ai.txt specification matures, this cross-reference will become increasingly important.

Selective Path Control: Beyond Blanket Rules

The configuration above uses blanket allow/block rules for entire crawlers. For many websites, this is sufficient. But larger sites with diverse content types may need more granular path-based control.

Protecting Sensitive Content

If your site has sections that should not be exposed to any AI crawler (internal documentation, member-only content, proprietary research), add path-specific blocks:

# Block all AI crawlers from premium content
User-agent: OAI-SearchBot
Disallow: /premium/
Disallow: /members/

User-agent: PerplexityBot
Disallow: /premium/
Disallow: /members/

Allowing Partial Access for Training

Some organisations may want to allow training crawlers access to specific content sections while blocking the rest. For example, an AI Agency might want its public blog posts included in training data (for brand visibility) while blocking proprietary methodology pages:

User-agent: GPTBot
Allow: /blog/
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Disallow: /

In this configuration, GPTBot and ClaudeBot can crawl /blog/ but are blocked from everything else. The Allow directive takes precedence over Disallow for matching paths, so the more specific /blog/ rule overrides the general / block.

The Layered Defense Strategy

robots.txt is a cooperative protocol. It relies on crawlers voluntarily complying with your directives. Responsible AI companies (OpenAI, Anthropic, Google, Perplexity) generally comply. But not every crawler is responsible, and not every entity identifies its crawlers honestly.

This is why I recommend a layered defense strategy that combines robots.txt with server-level enforcement:

Layer 1: robots.txt (Policy Declaration)

Your robots.txt file serves as the primary policy declaration. It is the universally recognised standard that responsible crawlers check before accessing your site. Configure it with the selective approach described above.

Layer 2: WAF and Edge Rules (Enforcement)

Web Application Firewalls (WAF) and edge rules provide enforcement that robots.txt cannot. Where robots.txt asks crawlers to comply, WAF rules force compliance by blocking requests at the network level.

Configure your WAF or CDN edge rules to:

Block user-agent strings of crawlers you have blocked in robots.txt. This enforces compliance for crawlers that ignore robots.txt.
Rate-limit unknown bot traffic. Legitimate crawlers respect rate limits. Aggressive, unidentified crawlers that fetch hundreds of pages per minute are almost certainly training data collectors.
Monitor and log bot traffic patterns. Before blocking anything, understand what is crawling your site and how aggressively.

If your site is on Cloudflare, Vercel, or similar platforms, their bot management tools make this enforcement relatively straightforward. The specifics vary by platform, but the principle is the same: robots.txt declares policy, WAF enforces it.

Layer 3: Monitoring and Alerting

Set up monitoring to detect new or unknown crawlers accessing your site. The AI crawler landscape changes frequently, with new bots appearing as new AI products launch. Without monitoring, a new training crawler could be consuming your content for weeks before you notice.

Key signals to monitor:

New user-agent strings in your access logs
Unusual traffic spikes from bot traffic
High-volume requests from single IP ranges
Crawling patterns that focus on content-heavy pages

Quarterly Review Process

The AI crawler landscape evolves rapidly. New AI companies launch, existing companies introduce new crawlers, and the distinction between training and search crawlers shifts as products evolve. A robots.txt configuration that is optimal today may be outdated in three months.

I recommend quarterly reviews that cover:

New crawler identification. Check industry resources and AI company announcements for new crawler user-agent strings. Add them to your robots.txt with appropriate allow/block rules.

Effectiveness verification. Review your server logs to confirm that blocked crawlers are actually being stopped. If a blocked crawler is still accessing your site, your WAF layer needs attention.

Policy alignment check. Verify that your robots.txt configuration still aligns with your business strategy. If you have started pursuing AEO for a new AI platform, ensure its crawler is allowed. If a search crawler has evolved into a training crawler, update accordingly.

Cross-file consistency. Ensure your robots.txt, llms.txt, and ai.txt files are consistent. Your robots.txt should not block a crawler that your ai.txt declares search retrieval preferences for.

Common Mistakes and How to Avoid Them

Mistake 1: Blocking Everything

The most common mistake is adding Disallow: / under User-agent: * or blocking all AI crawlers without distinguishing between training and search bots. This makes your site invisible to AI search engines, which is increasingly costly as AI search captures a growing share of user queries.

Fix: Use the selective approach. Block training crawlers. Allow search crawlers. Always leave User-agent: * with Allow: /.

Mistake 2: Not Including Sitemap

Omitting the Sitemap directive forces crawlers to discover your content by following links, which is slower and less reliable. Always include your sitemap reference so that allowed crawlers can efficiently find and index your content.

Fix: Add Sitemap: https://yourdomain.com/sitemap.xml at the end of your robots.txt file.

Mistake 3: Forgetting to Update

Setting up robots.txt once and never updating it is almost as bad as not having one. New AI crawlers appear regularly. A crawler that did not exist when you wrote your robots.txt could be consuming your content right now.

Fix: Schedule quarterly reviews. Assign ownership of the robots.txt file to whoever manages your AI Agency relationship or your technical SEO.

Mistake 4: Not Testing

robots.txt syntax errors can have unintended consequences. A misplaced wildcard or a typo in a user-agent string can silently break your configuration, either blocking crawlers you want to allow or allowing crawlers you want to block.

Fix: Use Google’s robots.txt tester to validate your file after every change. Test with specific URLs and user-agent strings to confirm the behaviour matches your intent.

Mistake 5: Ignoring LLMsTxt Directives

The LLMsTxt and LLMsFullTxt directives are new additions that many sites overlook. These directives help AI crawlers discover your llms.txt files, which provide curated content guides optimised for language model consumption. Omitting them means AI systems may never discover your LLM-optimised content.

Fix: Add both directives to your robots.txt, pointing to the URLs where your llms.txt and llms-full.txt files are hosted.

How AI Agencies Should Manage This for Clients

For AI Agency practitioners, robots.txt configuration for AI crawlers should be a standard service within your GEO offerings. Here is the process I follow:

Initial audit. Review the client’s existing robots.txt for AI-relevant directives. Most clients either have no AI-specific configuration or have overly broad blocks. Document the current state and identify gaps.

Strategy alignment. Discuss with the client which AI platforms matter for their business. A B2B SaaS company might prioritise Perplexity and ChatGPT Search visibility. A consumer brand might focus on Google AI Overviews. Align the robots.txt configuration with these strategic priorities.

Implementation and testing. Deploy the updated robots.txt file, validate it with testing tools, and monitor server logs for the first two weeks to confirm the expected behaviour.

Ongoing management. Include robots.txt review in your quarterly AI governance review alongside llms.txt and ai.txt updates. Report to clients on which AI crawlers are accessing their site and how their content is appearing in AI search results.

Layered defense setup. For clients on platforms that support WAF or edge rules, configure enforcement layers that complement the robots.txt directives. This is particularly important for clients with high-value proprietary content.

The complete AI governance stack, robots.txt for enforcement, llms.txt for content curation, and ai.txt for policy declaration, should be a packaged service that every AI Agency offers. It is a high-value, recurring engagement that positions you as a strategic partner in your client’s AI readiness.

Need help implementing GEO for your website? Get help with AI automation.

Enjoyed this article?

Subscribe to get my latest insights on product management, program management, and growth strategy.

Subscribe to Newsletter