Aurora SEO: Regulate content crawling by search engines using robots.txt

When you publish content in the community, search engines (web robots or web crawlers) crawl these newly published pages to discover and gather information from them. After crawling the content, the search engines index these pages to provide relevant search results based on the search queries.

It is important to instruct the web crawlers to crawl only the relevant pages and ignore the pages that don't require crawling activity. Using Robots Exclusion Protocol (a file called robots.txt), you can indicate the resources that need to be included or excluded from the crawling activity.

When a new community is created, the Khoros platform configures the robots.txt file with the default rules for the community. The default rules include instructions, which are generic for all the communities.

Admins and members with permissions can view the Default Rules in Robots.txt Editor (from Settings > System > SEO area). In the editor, you can also add Custom Rules that are appended after the default rules.

Note: You cannot edit the default rules.

How does Robots.txt work?

You can find the robots.txt file in the root directory of your community by appending “robots.txt” at the end of the URL (https://site.com/robots.txt). The file includes the list of user agents (web robots), community URLs, and sitemaps with instructions indicating whether the user agents are allowed or disallowed to crawl the specified URLs.

When the user-agents or web crawlers enter your website, they first read the robots.txt file and proceed further with the crawling activity based on the instructions added in the file. The user-agents gather information only from the community pages that are allowed and are blocked from the pages that are disallowed.

Robots.txt syntax

The robots.txt includes these keywords that are widely used to specify the instructions:

User-agent: The name of the web crawler for which you are providing the instructions.
Example:
User-agent: testbot

To provide instructions to all the user agents at a time, enter * (wildcard character).
Example:
User-agent: *
Disallow: Command to indicate the user-agents not to crawl the specified URL. Note that the URL must begin with ‘/’ (forward slash character).
Example:
User-agent: testbot
Disallow: /www.test1.com
Allow: Command to indicate the user-agents that they can crawl the specified URL. Note that the URL must begin with ‘/’ (forward slash character).
Example:
User-agent: testbot
Allow: /www.test2.com
Sitemap: Indicates the location of any XML sitemaps associated with the URL. The Khoros platform automatically generates sitemaps for each community when it is created and adds them to the robots.txt file.
Example:
User-agent: testbot
Sitemap: https://www.test.com/sitemap.xml

The following is the sample format to allow or disallow a user-agent "testbot” to crawl the community pages:

User-agent: testbot
Disallow: /www.test.com
Allow: /www.test1.com
Sitemap: https://www.test.com/sitemap.xml

Using the Robots.txt Editor

The Robots.txt Editor enables you to add, edit, and remove custom rules to robots.txt. You can look for more information provided by Google and other crawlers handling rules in robots.txt.

Let’s take an example where you want to add a custom rule to disallow a user-agent “testbot” from crawling a member profile page of the community.

To add a custom rule:

Sign in to the community as an Admin.
Go to Settings > System > SEO.
In the Robots.txt Editor, you can view the Default Rules and Custom Rules sections.
In the Custom Rules section, click Edit.
In the Edit window, enter the instructions and click Save.

The rule appears in the Custom Rules area of the tab.

You can edit or remove the existing Custom Rules by clicking the Edit option.

The new custom rules get appended to the robots.txt file located in the root directory:

After you edit the custom rules, you can validate the robots.txt via the Lighthouse tool. Learn more about robots.txt validation using lighthouse.

Note: The Audit log records the member actions made in the robots.txt file.

Updated 2 years ago

Version 5.0

communities aurora

moderation

Search & SEO

7 Comments

CKummer
Advisor
2 years ago
Hi everybody, I would like to edit the robot.txt. But I cannot find it in my admin settings. Do I need a special permission for getting access to it?
Thx in advance
tyw
Boss
2 years ago
The steps above apply to Aurora
If you're using Community Classic you can edit robots.txt within Studio

Admin is all thats needed permission wise AFAIK as well 🙂

Community Classoc - What is the default robots.txt file and how can I access or change it?
CKummer
Advisor
2 years ago
tyw Thx. I wasn't aware that it's for Aurora because I was seraching for that topic and have entered this article by clicking on the search result without further investigation.
MGourlay
Adept
2 years ago
Hello, I have read through the article, but im still not 100% confident on making a custom rule.

We have found that google is returning results for posts that we have archived and we dont want this to happen.

Is it possible to disallow certain boards through this feature
Is 'googlebot' all that is needed for google searches and then I would have the / URL related to the specific boards I want removed from search?

Thanks in advance for the help
MeghanaS
Khoros Alumni (Retired)
2 years ago
MGourlay

To stop pages belonging to certain boards from being crawled, you can create a custom rule and disallow pages from those boards. e.g., To stop pages of a forum board from being crawled, use " Disallow: /discussions/<board-name>/* "

You don't necessarily have to mention a user agent. If you don't mention a user agent, " User-agent: * " gets applied by default.
irach15
Maven
3 months ago
Hi, we've found in Classic - hiding boards/categories and content is not blocking Google from showing them in Google search.
Has it been fixed in Aurora?
MeghanaS
MeghanaS
Khoros Alumni (Retired)
3 months ago
We have a feature in the works to add noindex meta tag to categories, boards, and posts. This would block them from showing up in search engine results. This should be available later this quarter.

Knowledge Base Article

Aurora SEO: Regulate content crawling by search engines using robots.txt

How does Robots.txt work?

Robots.txt syntax

Using the Robots.txt Editor