Knowledge Base Article

Aurora SEO: Regulate content crawling by search engines using robots.txt

When you publish content in the community, search engines (web robots or web crawlers) crawl these newly published pages to discover and gather information from them. After crawling the content, the search engines index these pages to provide relevant search results based on the search queries.

It is important to instruct the web crawlers to crawl only the relevant pages and ignore the pages that don't require crawling activity. Using Robots Exclusion Protocol (a file called robots.txt), you can indicate the resources that need to be included or excluded from the crawling activity. 

When a new community is created, the Khoros platform configures the robots.txt file with the default rules for the community. The default rules include instructions, which are generic for all the communities. 

Admins and members with permissions can view the Default Rules in Robots.txt Editor (from Settings > System > SEO area). In the editor, you can also add Custom Rules that are appended after the default rules. 

Note: You cannot edit the default rules.

How does Robots.txt work?

You can find the robots.txt file in the root directory of your community by appending “robots.txt” at the end of the URL (https://site.com/robots.txt). The file includes the list of user agents (web robots), community URLs, and sitemaps with instructions indicating whether the user agents are allowed or disallowed to crawl the specified URLs.

When the user-agents or web crawlers enter your website, they first read the robots.txt file and proceed further with the crawling activity based on the instructions added in the file. The user-agents gather information only from the community pages that are allowed and are blocked from the pages that are disallowed. 

Robots.txt syntax

The robots.txt includes these keywords that are widely used to specify the instructions:

  • User-agent: The name of the web crawler for which you are providing the instructions. 
    Example:
    User-agent: testbot

    To provide instructions to all the user agents at a time, enter * (wildcard character).
    Example:
    User-agent: *

  • Disallow: Command to indicate the user-agents not to crawl the specified URL. Note that the URL must begin with ‘/’ (forward slash character).
    Example:
    User-agent: testbot
    Disallow: /www.test1.com

  • Allow: Command to indicate the user-agents that they can crawl the specified URL. Note that the URL must begin with ‘/’ (forward slash character).
    Example:
    User-agent: testbot
    Allow: /www.test2.com

  • Sitemap: Indicates the location of any XML sitemaps associated with the URL. The Khoros platform automatically generates sitemaps for each community when it is created and adds them to the robots.txt file.
    Example:
    User-agent: testbot
    Sitemap: https://www.test.com/sitemap.xml

The following is the sample format to allow or disallow a user-agent "testbot” to crawl the community pages:

User-agent: testbot
Disallow: /www.test.com
Allow: /www.test1.com
Sitemap: https://www.test.com/sitemap.xml

Using the Robots.txt Editor

The Robots.txt Editor enables you to add, edit, and remove custom rules to robots.txt. You can look for more information provided by Google and other crawlers handling rules in robots.txt. 

Let’s take an example where you want to add a custom rule to disallow a user-agent “testbot” from crawling a member profile page of the community.

To add a custom rule:

  1. Sign in to the community as an Admin.
  2. Go to Settings > System > SEO.
    In the Robots.txt Editor, you can view the Default Rules and Custom Rules sections.
  3. In the Custom Rules section, click Edit.
  4. In the Edit window, enter the instructions and click Save.

    The rule appears in the Custom Rules area of the tab.

You can edit or remove the existing Custom Rules by clicking the Edit option. 

The new custom rules get appended to the robots.txt file located in the root directory:

After you edit the custom rules, you can validate the robots.txt via the Lighthouse tool. Learn more about robots.txt validation using lighthouse.

Note: The Audit log records the member actions made in the robots.txt file.

 

Updated 8 months ago
Version 5.0

5 Comments

  • Hi everybody, I would like to edit the robot.txt. But I cannot find it in my admin settings. Do I need a special permission for getting access to it?

    Thx in advance

  • tyw  Thx. I wasn't aware that it's for Aurora because I was seraching for that topic and have entered this article by clicking on the search result without further investigation. 

  • Hello, I have read through the article, but im still not 100% confident on making a custom rule.

    We have found that google is returning results for posts that we have archived and we dont want this to happen.

    Is it possible to disallow certain boards through this feature
    Is 'googlebot' all that is needed for google searches and then I would have the / URL related to the specific boards I want removed from search?

    Thanks in advance for the help

  • MGourlay 

    To stop pages belonging to certain boards from being crawled, you can create a custom rule and disallow pages from those boards. e.g., To stop pages of a forum board from being crawled, use  " Disallow: /discussions/<board-name>/* " 

    You don't necessarily have to mention a user agent. If you don't mention a user agent, " User-agent: * " gets applied by default.