Thursday, March 9, 2017

List of All User Agents for Top Search Engines

Here is a working list of all user agents for the top search engines. I use this information frequently for my plugins such as Blackhole for Bad Bots and BBQ Pro, so I figured it would be useful to post the information online for the benefit of others. Having the user agents for these popular bots all in one place helps to streamline my development process. Each search engine includes references and a regex pattern to match all known user agents.

Search Engines

(In alphabetical order)

AOL.com

Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)

Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.27; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.21; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E)

Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.19; Windows NT 5.1; Trident/4.0; GTB7.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)

Mozilla/4.0 (compatible; MSIE 8.0; AOL 9.7; AOLBuild 4343.19; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E)

Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.7; AOLBuild 4343.19; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E)

Plus many more older versions/variations (see links for details)..
Regex to match all of these user-agent strings: aolbuild
References:

Baidu

Baiduspider
Baidu Web Search 
Baidu Image Search 
Baiduspider-image
Baidu Mobile Search 
Baiduspider-mobile
Baidu Video Search 
Baiduspider-video
Baidu News Search 
Baiduspider-news
Baidu Bookmark Search 
Baiduspider-favo
Baidu Union Search 
Baiduspider-cpro
Baidu Business Search 
Baiduspider-ads

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Baiduspider+(+http://www.baidu.com/search/spider_jp.html)

Baiduspider+(+http://www.baidu.com/search/spider.htm)
Regex to match all of these user-agent strings: baidu
References:

Bingbot/MSN

Mozilla/5.0 (compatible; bingbot/2.0 +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

msnbot/2.0b (+http://search.msn.com/msnbot.htm)

msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

Mozilla/5.0 (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; adidxbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko BingPreview/1.0b
Regex to match all of these user-agent strings: bingbot, bingpreview, msnbot
References:

DuckDuckGo

DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
Regex to match this user-agent string: duckduckgo
References:

Google

Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot/2.1 (+http://www.google.com/bot.html)

Googlebot-News

Googlebot-Image/1.0

Googlebot-Video/1.0

SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

[various mobile device types] (compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html)

Mediapartners-Google

AdsBot-Google (+http://www.google.com/adsbot.html)
Regex to match all of these user-agent strings: adsbot-google, googlebot, mediapartners-google
References:

Teoma

Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://sp.ask.com/docs/about/tech_crawling.html)

Mozilla/2.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)

Mozilla/2.0 (compatible; Ask Jeeves/Teoma)

Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)
Regex to match all of these user-agent strings: teoma
References:

Yahoo!

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Regex to match this user-agent string: slurp
References:

Yandex

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexAccessibilityBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexDirectDyn/1.0; +http://yandex.com/bots

Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexVideo/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexMedia/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexBlogs/0.99; robot; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexWebmaster/2.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexPagechecker/1.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexImageResizer/2.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YaDirectFetcher/1.0; Dyatel; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexCalendar/1.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexSitelinks; Dyatel; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexMetrika/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexAntivirus/2.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexVertis/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)
Regex to match all of these user-agent strings: yandex

Labels: , ,

What Is a Robots.txt File

robots.txt file

A robots.txt file provides search engines with the necessary information to properly crawl and index a website. Search engines such as Google, Bing, Yahoo, etc all have bots that crawl websites on a periodic basis in order to collect existing and / or new information such as web pages, blog articles, images, etc. Once these resources are published via the website it is up to the search engines to determine what will be indexed.
A robots.txt file can help you better define what you want the search bots to crawl and therefore index. Doing this is useful for a variety of reasons including controlling crawl traffic to help ensure that the crawler does not overwhelm your server. The robots.txt file however, should not be used to hide web pages from Google search results.

How to Create a Robots.txt File

Implementing the use of a robots.txt file is really quite simple and can be done in just a few steps.
  1. The first step is to actually create your robots.txt file. This can be achieved by creating a file called “robots.txt” with a simple text editor.
  2. Next, define your parameters within the robots.txt file. Example use cases have been outlined in the next section.
  3. Upload your robots.txt file to your website’s root directory. Now whenever a search engine crawls your site, it will check your robots.txt file first to determine if there are sections of the website which shouldn’t be crawled.

Robots.txt Examples

There are a variety of possibilities available when configuring a robots.txt file. The basic structure of a robots.txt file is quite simple and often contains primary components such as: User-agent, Disallow, or Allow. The User-agent specifies which search engine robots the following rules apply to. This field can be defined as User-agent: * to apply to all robots while a specific robot can also be specified such as User-agent: Googlebot.
Additionally, you can also use the Allow and Disallow instructions to enable greater configuration granularity. The following outlines a few general configurations.

Example #1

This example instructs all Search engine robots to not index any of the website’s content. This is defined by disallowing the root “/” of your website.
User-agent: *
Disallow: /

Example #2

This example achieves the opposite of the previous one. In this case, the instructions are still applied to all user agents, however there is nothing defined within the Disallow instruction, meaning that everything can be indexed.
User-agent: *
Disallow:

Example #3

This example displays a little more granularity pertaining to the instructions defined. Here, the instructions are only relevant to Googlebot. More specifically, it is telling Google not to index a specific page: your-page.html.
User-agent: Googlebot 
Disallow: /no-index/your-page.html

Example #4

This example uses both the Disallow and Allow instructions. The directory /images is disallowed to be indexed by all search bots however, by defining Allow: /images/logo.png, we can override the Disallow for a particular file, in this case logo.png.
User-agent: *
Disallow: /images
Allow: /images/logo.png

Example #5

The final example is a use-case where JS, CSS, PNG files within the /demo/ folder are allowed to be indexed by the web crawler, while all other files are not. The * before the filetype extension indicates that all files with this extension are allowed.
User-agent: *
Allow: /demo/*.js
Allow: /demo/*.css
Allow: /demo/*.png
Disallow: /demo/

Using a Robots.txt File with a CDN

If you’re using a CDN, you may also have the ability to define instructions for the CDN’s robots.txt file. KeyCDN doesn’t enable the robots.txt file by default meaning that everything will be crawled. You can however choose to enable the robots.txt file within the advanced features section of your zone.
 keycdn robots.txt file
Once the robots.txt file is enabled in KeyCDN, its default is set to:
User-agent: *
Disallow: /
However, this can be modified by adding custom instructions to the custom robots.txt box. Once changes here are made, be sure to save your settings and purge your zone.
For an in-depth guide to CDN SEO read our Indexing Images in SERPs article.

Summary

There are many reasons for having a robots.txt file including:
  • You don’t want search bots to index particular content,
  • Your site isn’t live yet,
  • You want to specify which search bots can index content, etc.
However, it is not always necessary to have a robots.txt file. If you do not want to instruct search bots on how to crawl your website, you simply do not need a robots.txt file. Alternatively, in the case that you do, simply adding the file to the site’s root directory and accessing it via https://example.com/robots.txt will allow you to easily customize how web crawlers will scan your site.
Be aware that when creating a robots.txt file you are not blocking any resources that the search bots need in order to properly index your content.
google robots test tool
To help verify this, Google Search Console provides a robots.txt tester which you can use to verify that your robots file is not blocking any important content.

Labels: ,