How to stop AI bots from crawling your website

  • 02 / 11 / 2023
  • Alicja Graczyk
How to stop AI bots from crawling your website

In rapidly changing online realities, publishers have to deal with more and more challenges. Right now, it’s not only crucial to provide high-quality content and be responsible for factors like the website’s SEO, User Experience, and their safety but also to be aware of the growing possible threats. Artificial intelligence, or rather, scraping and then using your content to train its models, is surely among these. Fortunately, ChatGPT – the most popular language model-based chatbot – provides an opt-out option. So, if you want to know how to prevent robots from crawling website of yours – you’ve come to the right place!

What is AI training?

AI training is a process of teaching an artificial intelligence system to correctly analyze data and learn from it so that the AI program can carry out various tasks that include making decisions based on the information provided. To successfully conduct AI training, three things are required:

  • artificial intelligence model that needs training,
  • suitable data,
  • powerful computing platform.

Okay, but the question remains – what do AI models train on? For instance, in the case of ChatGPT (GPT-3), the answer is mainly Common Crawl – a web archive that has been collecting data since 2008. Among other sources from which it crawled data, “Wikipedia, The Free Encyclopedia” is worth mentioning.

Publishers’ AI-related concerns

Digital content creators can benefit from AI tools in various ways, but not all of them are willing to share their unique content for the purpose of training AI models. A worry arises that the artificially-generated suggestions could be too similar to original content, although AI companies state that copying and pasting of their creations is not possible due to the fact that they don’t store information AI models train on. Regardless of their assurances, in some cases, a text that is solely based on AI output might be, in fact, very similar to the original one. In worst-case scenarios, if the new text becomes more popular, search engine ranking systems may consider the original content as plagiarism! Consequently, your content will rank lower and may become less attractive for both users and, if you monetize it, for advertisers!

source: https://giphy.com/

It’s worth noting that the White House stated that several US-based AI companies, including OpenAI, committed to creating a watermarking system indicating that the content was generated by AI. Still, they have not promised to discontinue the use of internet data for training purposes. In fact, Meta and OpenAI faced legal action from many authors in a California Court for utilizing their books in AI training, including Sarah Silverman, an American comedian and writer. When it comes to Europe, recently, the European Parliament voted over a draft which, among others, states that artificial intelligence programs generating content would have to clearly indicate its artificial origin.

Nonetheless, OpenAI has earned praise from many creators for sharing a code that can prevent ChatGPT from learning from their websites. All you have to do is follow the steps we guided on below!

How to stop GPTBot?

OpenAI released a way for publishers to prevent GPTBot – their web crawler – from reading your valuable content and later using it to generate responses in ChatGPT. This bot’s main task is to perfect GPT-4 and GPT-5 language models. It is possible to restrict GPTBot from your entire website or specific parts of it by adding it to your site’s robots.txt – a kind of guide for web crawlers that informs which areas of your website are accessible. In order to do that in the first place, you have to copy the text from the chosen frame below and save it as a .txt file. Afterward, you have to add it to your site. The process may differ, depending on the server architecture and solutions applied by your hosting provider. This is why, in order to add a robots.txt file, contact your hosting company or find the company’s documentation regarding the process.

GPTBot prevention – entire website

To prevent GPTBot from accessing your entire site, you should add the following robots.txt to your site:

User-agent: GPTBot
Disallow: /

GPTBot prevention – website’s part

To prevent GPTBot from accessing only a chosen part of your site, you should add the following GPTBot token to your site’s robots.txt:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Please note that in the “Allow” and “Disallow” lines, you should insert the appropriate directories’ names.

In order to ensure that you have restricted bot access to your website successfully, you can use robots.txt testing tools like Logeix.

Protect your flock!

Just as a shepherd watches over their flock, you should safeguard your creation. The internet is a rapidly changing environment, and it is worth staying up-to-date with all the newly emerging threats so that you don’t fall prey to traps you can easily omit. If you’re interested in this kind of content, don’t hesitate to visit our blog, where we continuously discuss various issues and news from the world of monetization, digital content creation, and its optimization!

source: https://giphy.com/

Read also

IAB terms – what digital content creators should know
IAB terms – what digital content creators should know

IAB terms – what digital content creators should know

Familiarizing with the most important IAB standards enables publishers to ensure compliance and open doors to the latest innovations and revenue possibilities in digital advertising. Don’t miss that chance and read more!

Read more
Digital publishers’ guide to cache
Digital publishers’ guide to cache

Digital publishers’ guide to cache

If you are a digital publisher, high chances are that you heard about cache. But have you ever delved into details? In this article, we will explain everything that’s necessary for you to know!

Read more

Find the best solutions
for your business

Benefit from expert knowledge

Start earning more

Registration to the optAd360 network

Increase your ad revenue!

Join satisfied publishers who, thanks to the optimization of their advertising space
with our technology, started to generate greater profits.

Sign up