Ronalds Vilciņš

04Mab%!0rD

How to block AI crawler bots

The rise of AI-powered tools is both exciting and potentially disruptive for content creators. While it offers powerful options for enhancing content creation, it also raises concerns about content ownership and traffic redirection. Content writers might feel uneasy about AI using their work, leading them to consider blocking it. As AI tools such as Perplexity, Microsoft Copilot, Gemini, and ARC Browser get better at using content, they become a problem. These tools can now answer questions directly by searching for the most relevant information. AI-powered answers might bypass blog sites, potentially impacting traffic and engagement, which could affect audience reach and revenue.

Easily manage AI crawlers with Clouflare WAF bot category

You can block AI bots in a simple firewall rule, while still allowing search engine crawlers to index your site. Cloudflare have better bot segmentation, and you can choose specific responses to specific types of bots. To take advantage of these new bot categories, simply log in to the Cloudflare dash, go to the WAF tab, create a rule, and choose one of the Verified Bot sub categories as the Field.

There you need to set:

The Expression Preview should look like this:

(cf.verified_bot_category eq "AI Crawler")

Block AI crawler bots using robots.txt file

A robots.txt file serves as a text document that provides instructions to bots, including search engine crawlers, on how to navigate and index website pages. It allows you to grant or restrict access to both desirable and undesirable bots based on the directives outlined in your robots.txt file.

The syntax is as follows to block a single bot using a user-agent:

User-agent: {BOT-NAME}
Disallow: /

An example how to block OpenAI, Google Gemini (Bard), Commoncrawl, Perplexity AI bot:

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /

Block AI crawler bots using IP address

Here’s a list of user agents used by OpenAI crawlers and fetchers, along with IP addresses you can block with your web server firewall to stop its AI bot plugin. You can block the 23.98.142.176/28 using the ufw command or iptables command on your web server.

sudo ufw deny proto tcp from 23.98.142.176/28 to any port 80
sudo ufw deny proto tcp from 23.98.142.176/28 to any port 443

Block AI crawler bots using “noai” and “noimageai” meta tag

Add this code snippet in section of your website <head>:

<meta name="robots" content="noai, noimageai" />

The “noai” and “noimageai” meta tags are intended for creators to opt out of having their content used for training AI models without their consent.

Please note that these meta tags are not yet an industry standard, and I cannot guarantee that any or all bots will respect them.

There are also wordpress plugin for that. This plugin is very simple: It adds a line of code in high priority to the header of every page on your site that politely asks AI crawlers not to use your content for AI training purposes.

As the owner of your website, you should have the ultimate say in who can crawl it. With just a few minutes of your time, you can assert control and block crawlers like those from OpenAI, Perplexity, and Google from accessing your content. This straightforward action is a powerful means to safeguard your site.