Aurora SEO: Regulate content crawling by search engines using robots.txt
When you publish content in the community, search engines (web robots or web crawlers) crawl these newly published pages to discover and gather information from them. After crawling the content, the search engines index these pages to provide relevant search results based on the search queries.
It is important to instruct the web crawlers to crawl only the relevant pages and ignore the pages that don't require crawling activity. Using Robots Exclusion Protocol (a file called robots.txt), you can indicate the resources that need to be included or excluded from the crawling activity.
When a new community is created, the Khoros platform configures the robots.txt file with the default rules for the community. The default rules include instructions, which are generic for all the communities.
Admins and members with permissions can view the Default Rules in Robots.txt Editor (from Settings > System > SEO area). In the editor, you can also add Custom Rules that are appended after the default rules.
Note: You cannot edit the default rules.
How does Robots.txt work?
You can find the robots.txt file in the root directory of your community by appending “robots.txt” at the end of the URL (https://site.com/robots.txt). The file includes the list of user agents (web robots), community URLs, and sitemaps with instructions indicating whether the user agents are allowed or disallowed to crawl the specified URLs.
When the user-agents or web crawlers enter your website, they first read the robots.txt file and proceed further with the crawling activity based on the instructions added in the file. The user-agents gather information only from the community pages that are allowed and are blocked from the pages that are disallowed.
Robots.txt syntax
The robots.txt includes these keywords that are widely used to specify the instructions:
- User-agent: The name of the web crawler for which you are providing the instructions.
Example:
User-agent: testbot
To provide instructions to all the user agents at a time, enter * (wildcard character).
Example:
User-agent: * - Disallow: Command to indicate the user-agents not to crawl the specified URL. Note that the URL must begin with ‘/’ (forward slash character).
Example:
User-agent: testbot
Disallow: /www.test1.com - Allow: Command to indicate the user-agents that they can crawl the specified URL. Note that the URL must begin with ‘/’ (forward slash character).
Example:
User-agent: testbot
Allow: /www.test2.com - Sitemap: Indicates the location of any XML sitemaps associated with the URL. The Khoros platform automatically generates sitemaps for each community when it is created and adds them to the robots.txt file.
Example:
User-agent: testbot
Sitemap: https://www.test.com/sitemap.xml
The following is the sample format to allow or disallow a user-agent "testbot” to crawl the community pages:
User-agent: testbot
Disallow: /www.test.com
Allow: /www.test1.com
Sitemap: https://www.test.com/sitemap.xml
Using the Robots.txt Editor
The Robots.txt Editor enables you to add, edit, and remove custom rules to robots.txt. You can look for more information provided by Google and other crawlers handling rules in robots.txt.
Let’s take an example where you want to add a custom rule to disallow a user-agent “testbot” from crawling a member profile page of the community.
To add a custom rule:
- Sign in to the community as an Admin.
- Go to Settings > System > SEO.
In the Robots.txt Editor, you can view the Default Rules and Custom Rules sections. - In the Custom Rules section, click Edit.
- In the Edit window, enter the instructions and click Save.
The rule appears in the Custom Rules area of the tab.
You can edit or remove the existing Custom Rules by clicking the Edit option.
The new custom rules get appended to the robots.txt file located in the root directory:
After you edit the custom rules, you can validate the robots.txt via the Lighthouse tool. Learn more about robots.txt validation using lighthouse.
Note: The Audit log records the member actions made in the robots.txt file.