Skip to main content

Robots.txt template

Summary

Simple robots.txt template. Allow lists legitimate user-agents / bots. Keeps unwanted robots out (disallow) by default. The template is easily maintainable and immediately useful for most websites.

Introduction #

The Robots Exclusion Standard, also known as the Robots Exclusion Protocol or simply robots.txt, is a standard used by websites to give instructions to web robots.

This page contains a simple, easy to understand and easy to maintain robots.txt template, which should be immediately useful for most websites.

The template allows access of legitimate robots (e.g., search engine crawlers) while keeping unwanted web robots (e.g., scraper bots, people search engines, SEO tools, marketing tools, etc.) away from your website.

How it works #

The robots.txt template on this page follows an easily maintainable allowlist approach:

The template only lists robots that are allowed to access the website and generally excludes all other robots.

Most robots.txt files and examples you can find online, work exactly the other way around: they exclude specific robots (User-agents) from visiting the website or parts of the website.

Maintaining such a blocklist approach requires a lot of effort and knowledge of robots to exclude. An idle job.

Download template #

You can download the robots.txt template here:

Download robots.txt template

Reference template #

You can reference the template at https://www.ditig.com/robots.txt on your server, to always get the latest updates.

As per rfc9309, if your server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302, the crawler should follow at least five consecutive redirects, even across authorities (for example, hosts). If a robots.txt file is reached, the robots.txt file must be fetched, parsed, and its rules followed in the context of the initial authority.

Copy template #

You can inspect and copy the robots.txt template below:

############################# ROBOTS.TXT ###############################
# Updates and informantion can be found at:                            #
# https://www.ditig.com/publications/robots-txt-template               #
# This document is licensed with a CC BY-NC-SA 4.0 license.            #
# Last update: 2023-11-28                                              #
########################################################################
# so.com chinese search engine
User-agent: 360Spider
User-agent: 360Spider-Image
User-agent: 360Spider-Video
# google.com landing page quality checks
User-agent: AdsBot-Google
User-agent: AdsBot-Google-Mobile
# google.com app resource fetcher
User-agent: AdsBot-Google-Mobile-Apps
# bing ads bot
User-agent: adidxbot
# apple.com search engine
User-agent: Applebot
user-agent: AppleNewsBot
# baidu.com chinese search engine
User-agent: Baiduspider
User-agent: Baiduspider-image
User-agent: Baiduspider-news
User-agent: Baiduspider-video
# bing.com international search engine
User-agent: bingbot
User-agent: BingPreview
# bublup.com suggestion/search engine
User-agent: BublupBot
# commoncrawl.org open repository of web crawl data
User-agent: CCBot
# cliqz.com german in-product search engine
User-agent: Cliqzbot
# coccoc.com vietnamese search engine
User-agent: coccoc
User-agent: coccocbot-image
User-agent: coccocbot-web
# daum.net korean search engine
User-agent: Daumoa
# dazoo.fr french search engine
User-agent: Dazoobot
# deusu.de german search engine
User-agent: DeuSu
# duckduckgo.com international privacy search engine
User-agent: DuckDuckBot
User-agent: DuckDuckGo-Favicons-Bot
# eurip.com european search engine
User-agent: EuripBot
# exploratodo.com latin search engine
User-agent: Exploratodo
# facebook.com social network
User-agent: facebookcatalog
User-agent: facebookexternalhit
User-agent: Facebot
# feedly.com feed fetcher
User-agent: Feedly
# findx.com european search engine
User-agent: Findxbot
# goo.ne.jp japanese search engine
User-agent: gooblog
# google.com international search engine
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: Googlebot-Mobile
User-agent: Googlebot-News
User-agent: Googlebot-Video
# so.com chinese search engine
User-agent: HaoSouSpider
# goo.ne.jp japanese search engine
User-agent: ichiro
# istella.it italian search engine
User-agent: istellabot
# jike.com / chinaso.com chinese search engine
User-agent: JikeSpider
# lycos.com & hotbot.com international search engine
User-agent: Lycos
# mail.ru russian search engine
User-agent: Mail.Ru
# google.com adsense bot
User-agent: Mediapartners-Google
# Preview bot for Microsoft products
User-agent: MicrosoftPreview
# mojeek.com search engine
User-agent: MojeekBot
# bing.com international search engine
User-agent: msnbot
User-agent: msnbot-media
# orange.com international search engine
User-agent: OrangeBot
# pinterest.com social networtk
User-agent: Pinterest
# botje.nl dutch search engine
User-agent: Plukkie
# qwant.com french search engine
User-agent: Qwantify
# rambler.ru russian search engine
User-agent: Rambler
# seznam.cz czech search engine
User-agent: SeznamBot
# soso.com chinese search engine
User-agent: Sosospider
# yahoo.com international search engine
User-agent: Slurp
# sogou.com chinese search engine
User-agent: Sogou blog
User-agent: Sogou inst spider
User-agent: Sogou News Spider
User-agent: Sogou Orion spider
User-agent: Sogou spider2
User-agent: Sogou web spider
# sputnik.ru russian search engine
User-agent: SputnikBot
# twitter.com social media bot
User-agent: Twitterbot
# whatsapp.com preview bot
User-agent: WhatsApp
# yacy.net p2p search software
User-agent: yacybot
# yandex.com russian search engine
User-agent: Yandex
User-agent: YandexMobileBot
# yep.com search engine
User-agent: YepBot
# search.naver.com south korean search engine
User-agent: Yeti
# yioop.com international search engine
User-agent: YioopBot
# yooz.ir iranian search engine
User-agent: yoozBot
# youdao.com chinese search engine
User-agent: YoudaoBot
# crawling rule(s) for above bots
Disallow:
# disallow all other bots
User-agent: *
Disallow: /

Making adjustments #

I strongly recommend keeping alphabetical ordering when adding user agents. This means that duplicates are quickly identified and maintainability is maintained.

Add allowed / forbidden directories #

Simpley add your rules in the section # crawling rule(s) for above bots. For example, if you want to protect the /private/ directory, you can add it like this:

# crawling rule(s) for above bots
Disallow: /private/
Disallow:

Put Disallow: /private/ above the Disallow: (which equals the nonstandard allow).

In robots.txt the first matching rule - from top to bottom - wins. You do not want your Disallow: /private/ to be below the Disallow:.

Blocking bots #

Simply comment out or delete bots you do not wish to allow to access your website. With the allowlist approach, there is no need to block specific bots. Also: While robots.txt is a standard, not all web crawlers respect it. Some well-behaved crawlers follow the directives, but others may not. The ones listed in the template do follow the directives.

Warranty and liability #

The author makes absolutely no claims and representations to warranties regarding the accuracy or completeness of the information provided. Use the template on this website AT YOUR OWN RISK. Make adjustments as needed.

The descision which bot ended up in the list of tolerable bots, was done by the author, who is very conservative and opinionated when it comes to blocking bots. However, the author’s decisions should be sufficient for many.


FAQ's #

Most common questions and brief, easy-to-understand answers on the topic:

What is a robots.txt file?

A robots.txt file is a UTF-8 encoded text file placed on a website to instruct web robots (like search engine crawlers) which pages or sections of the site should not be crawled or indexed.

Where should the robots.txt file be located?

The robots.txt file must be placed at the root level of the domain name, accessible at e.g., https://example.com/robots.txt or ftp://ftp.example.com/robots.txt

Can a robots.txt file be located in other places than the domain root?

Yes. If a server responds to a robots.txt fetch request with a redirect, such as HTTP 301 or HTTP 302. Per rfc9309 standard, crawlers should follow at least five consecutive redirects, even across authorities (for example, hosts). If a robots.txt file is reached within these five consecutive redirects, the robots.txt file must be fetched, parsed, and its rules followed in the context of the initial authority.

Can robots.txt be used to prevent a page from being indexed by search engines?

While robots.txt can instruct crawlers not to crawl specific pages, it does not guarantee that those pages will not be indexed. If a page is linked from other indexed pages, it might still be indexed

How often do search engines check the robots.txt file?

Search engines typically check the robots.txt file each time they visit a website.


Further readings #

Sources and recommended, further resources on the topic:

Author

Jonas Jared Jacek • J15k

Jonas Jared Jacek (J15k)

Jonas works as project manager, web designer, and web developer since 2001. On top of that, he is a Linux system administrator with a broad interest in things related to programming, architecture, and design. See: https://www.j15k.com/

License

License: Robots.txt template by Jonas Jared Jacek is licensed under CC BY-NC-SA 4.0.

This license requires that reusers give credit to the creator. It allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only. If others modify or adapt the material, they must license the modified material under identical terms. To give credit, provide a link back to the original source, the author, and the license e.g. like this:

<p xmlns:cc="http://creativecommons.org/ns#" xmlns:dct="http://purl.org/dc/terms/"><a property="dct:title" rel="cc:attributionURL" href="https://www.ditig.com/publications/robots-txt-template">Robots.txt template</a> by <a rel="cc:attributionURL dct:creator" property="cc:attributionName" href="https://www.j15k.com/">Jonas Jared Jacek</a> is licensed under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/" target="_blank" rel="license noopener noreferrer">CC BY-NC-SA 4.0</a>.</p>

For more information see the DITig legal page.


“Learning from conventions will make your site better.”

Jeffrey Veen, American designer and design strategistThe Art & Science of Web Design, - IT quotes