2026-06-12 · 9 min read

AI Crawlers and Hosting: Can ChatGPT, Claude and Perplexity Reach You?

AI search visibility is the new SEO, but most people never check the boring prerequisite: can the bot actually fetch the page? I'll show you the crawlers that matter, why robots.txt and Cloudflare are two separate gates that don't talk to each other, and how a single edge toggle can lock out the very bots you're trying to court. Then I hand you the exact grep and curl I use to stop guessing.

Jakub Mareš

Author

AI crawlers, hosting access and bot blocking for ChatGPT, Claude, Perplexity

Everybody wants to "rank in ChatGPT" now. And honestly? It makes sense — when someone asks an assistant for a hosting tip or a how-to and your site gets quoted, that's traffic with real intent landing on your doorstep. But there's a step everyone skips in the breathless LinkedIn posts: before any AI can cite you, a bot has to fetch your page. And from what I see on servers I actually administer, a worrying number of sites are quietly blocking the exact crawlers they're trying to impress.

I've cleaned up this mess more than once. Someone flips on a "security" feature, forgets it exists, and half a year later wonders why they've gone invisible to every AI tool on the planet (the toggle is always two layers above where they're looking). So let's do this the way I'd do it on a real box: what the bots are, what actually blocks them, and how you check instead of pray.

Why should you care if a bot can reach you?

If you run docs, a SaaS product, or any technical site, AI assistants are increasingly the first place your audience lands. Someone asks Claude "how do I configure X," and if your docs are within reach, you get the mention. If they're not, your competitor does. Same game as Google, new referee.

Here's the catch — and there's always a catch — "being reachable" isn't one thing. There are different bots doing different jobs, and blocking one tells you nothing about the others. Treat them as a single switch and you'll get burned.

Three bots, three completely different jobs

This is the part most "GEO" guides wave past, and it's the one that matters most. The AI companies run multiple crawlers, and they are not interchangeable.

Roughly three buckets:

Training crawlers — they collect content that may feed future models. Blocking these does nothing to whether you show up in live answers.
Search/index bots — these build the index behind the "search the web" feature. This is usually the one you want to allow if visibility is the goal.
User-triggered fetchers — these fire when someone pastes your URL or tells the assistant to read a page right now. Block these and the "open this link" feature breaks for your own visitors.

Here's OpenAI's lineup, straight from their bot docs (platform.openai.com/docs/bots):

Bot	Purpose	User-agent token
GPTBot	Crawls content that may be used for model training	`GPTBot`
OAI-SearchBot	Surfaces sites in ChatGPT search results	`OAI-SearchBot`
ChatGPT-User	Fetches a page when a user action triggers it	`ChatGPT-User`

The lesson hits straight away: if you block GPTBot to keep your content out of training (a totally fair call), you have not blocked OAI-SearchBot. OpenAI documents them as separate decisions because they are. The people who paste a generic "block AI" snippet usually nuke training and search in one go — and the search bot was the one carrying the traffic.

Anthropic and Perplexity follow the same shape: a training/crawling agent plus a user- or search-triggered fetcher. The commonly referenced token for Anthropic's crawler is ClaudeBot. I'm deliberately not going to recite every Perplexity user-agent string from memory, because these get renamed and added to over time, and a stale string sitting in your robots.txt is worse than no string at all. The honest move: read the vendor's own bot page, lift the current tokens, then check them against your real logs (we'll get there).

robots.txt vs Cloudflare: two gates that never talk

Now the layers. There are two completely separate places where an AI crawler gets told "no," and neither knows the other exists.

robots.txt is the polite gate. It's a text file at the root of your site (/robots.txt) listing which user-agents may crawl which paths. Per Google's own docs, it's a crawling directive, not access control — well-behaved bots read it and comply, but it leans entirely on the bot choosing to obey (Google Search Central). The major AI vendors say their crawlers respect it, and from what I see in logs, the big names do. But make no mistake — it's an honour system.

# Allow ChatGPT search, keep content out of training

User-agent: GPTBot

Disallow: /

User-agent: OAI-SearchBot

Allow: /

User-agent: ChatGPT-User

Allow: /

Cloudflare edge controls are the bouncer at the door. If your site sits behind Cloudflare, requests hit Cloudflare's network before they ever reach your origin or read your robots.txt. Cloudflare can block, challenge, or wave a bot through on its own rules — and that happens no matter what your robots.txt says. The bot never gets the chance to be polite, because it gets bounced at the edge.

This is the mismatch that catches people out. Your robots.txt says "welcome, OAI-SearchBot." Your Cloudflare dashboard says "block AI bots." Cloudflare wins. Every single time.

The Cloudflare toggle that quietly overrules you

Cloudflare ships features aimed squarely at AI crawlers, now grouped under AI Crawl Control (Cloudflare docs). It lets you see which AI bots are hitting you and decide whether to allow or block them at the edge. Cloudflare also pushed a "block AI bots" capability out very widely — they announced one-click controls plus auditing, and rolled blocking out even to free-plan customers (Cloudflare blog).

Genuinely useful if you want AI kept out. The trap is that it's so easy to enable that people flip it on for "protection" without ever weighing the visibility cost. And because it runs at the edge, it silently overrides whatever generous permissions you carefully wrote into robots.txt.

So when something's wrong, the order of operations is: check Cloudflare first, then robots.txt. I've watched people rewrite their robots.txt five times chasing a problem that was one dashboard toggle two layers up. Don't be that person (I've been that person; it's how I learned the order).

How do you actually check this?

Enough theory. Here's how I find out what's genuinely happening instead of what I'm hoping happens.

1. Grep your access logs for the bots. This is the source of truth — it shows who actually reached your origin.

grep -Ei "GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot" /var/log/nginx/access.log

If you allowed OAI-SearchBot in robots.txt but it never shows up here, something upstream (hello, Cloudflare) is eating those requests before they land.

2. Impersonate a bot with curl and watch the response. Send a request wearing an AI crawler's user-agent and see how your stack reacts:

curl -A "OAI-SearchBot" -I https://yourdomain.com/

A 200 means it got through. A 403, a challenge page, or a redirect to an "are you human" check means something is blocking that user-agent — usually a WAF or bot-management rule, not your app.

3. Read the response headers. Look at the Server header and any cf-* headers. If you spot cf-ray or Server: cloudflare, you're behind Cloudflare and the edge layer is in play — which means the dashboard, not just your config files, decides who gets in.

4. Fetch your own robots.txt and read it the way a bot does: top to bottom, matching the most specific user-agent group. Misordered or overlapping rules are a classic own-goal.

Why not just automate the whole correlation?

Let me be straight about what a tool can and can't do for you. The painful part here isn't the knowledge — it's that the answer is scattered across three or four places (robots.txt, Cloudflare, your WAF, your logs) and you have to stitch them together by hand. That's tedious and easy to get wrong, which is exactly the kind of chore worth handing to a machine.

A focused AI crawler checker could hit your site once per known AI user-agent, record the status code and headers for each, parse your robots.txt against those same tokens, and flag the mismatches: "robots.txt allows OAI-SearchBot, but the live request returned 403 — likely an edge block." It can spot Cloudflare from the response headers and point at the edge as the probable culprit. That turns half an hour of curl-and-grep into one result you can act on.

That's what hostingchecker.org is built to surface — no magic AI-visibility juice, just an honest, mechanical answer to "can these specific bots reach this specific site right now."

FAQ

Does blocking GPTBot hurt my ChatGPT search visibility?

No. GPTBot is the training crawler; OAI-SearchBot is the one tied to ChatGPT search results, and they're controlled separately (OpenAI docs). You can keep your content out of training and stay visible in search.

Is robots.txt enough to block AI crawlers?

Only for bots that choose to obey it. robots.txt is a crawling directive on the honour system, not enforced access control (Google). For real enforcement you need an edge or server-level rule.

Can Cloudflare block AI bots even if my robots.txt allows them?

Yes, and it's the most common gotcha I run into. Cloudflare's AI Crawl Control acts at the network edge before robots.txt is ever read, so an edge block overrides any permission in your file (Cloudflare docs).

Why don't I see ClaudeBot or OAI-SearchBot in my logs?

Either nobody's triggered a crawl yet, or something upstream is blocking the request before it hits your origin. Test with curl -A "OAI-SearchBot" -I and compare against a normal browser request to see whether that user-agent is being treated differently.

Should I just allow every AI bot?

Depends on your goals. If you want AI search referrals, allow the search and user-triggered fetchers. Whether you allow training crawlers is a separate, philosophical call — there's no traffic benefit either way.

How often should I re-check this?

Whenever you touch your CDN, WAF, or hosting setup — those are the changes that silently flip access. The bot user-agent names also evolve, so re-verify the current tokens from each vendor's docs rather than trusting an old snippet.

Bottom line: AI visibility starts with a boring, checkable fact — can the bot reach the page. Nail that down before you spend a cent on fancy "GEO optimisation."

Check technical visibility — run your domain through hosting checker and see exactly which AI crawlers your setup lets in.