Public
Not applicable

Internal search indexing too many pages of the community

We have a problem with our internal search engine (powered by Adobe) which is indexing over half a million pages of our community and reaches the limit of indexable pages. How can we limit the number of indexable pages? Is there a page identifier in the URL that we can use for exclusion?

6 Replies 6
Honored Contributor

Hi, 

 

if you think about user profiles and all the topics you may have its not a surprise that you have millions of results it's actually why using a 3rd party search solution for the community "for me" isnt very viable. 

 

How does adobe pull the content to index, does it crawl your site or use api pull? I've never used the adobe solution so not sure what level of filtering you have but you should be able to limit certain pages for example:

 

User profile pages will all have:

/t5/user/viewprofilepage/user-id/?????

 

The bulk of your results "i suspect" will be topics and contain (/td-p/) however there isnt any way to filter out old content as ther date paretment isnt in the url. If you're trying to limit/control whats indexed, you may have to splice your content into different boards and have certain boards excluded from search whilst others are included. 

 

I dont have a 3rd party search but i do somethign similar for organic search, i have an archive board which is blocked by robots, old content gets pushed into that board, its also hidden so it doesnt appear in the sites local search and we activly remove from search console or restore content if gettign large traffic whilst in the archive. 

 

 

Stephen

Checkout some of the stuff i've built using the platform:
Community l Ideation l Blog l Product Hubs l Check & Report l Service Status 

My latest Ideas: Vanity URL Manager l @mention Roles l  LSW Password Policy

Not applicable

Hi Stephen, thanks very much for the comprehensive and quick response, I think the approach of separating current and old content and block out the latter may be the right one. I understand from your reply that you use the robot.txt file to block content indexing, right? Is this something you do in Studio? Do you have a global robot file or do you have local files for each board?

Hi,

Yeah, you can update the robot file in studio within the "ADVANCED" tab or via a support ticket if you prefer that? its global for site so you'd just add a rule to block a given board such as:

Disallow: /t5/Archive-Store/

Stephen

Checkout some of the stuff i've built using the platform:
Community l Ideation l Blog l Product Hubs l Check & Report l Service Status 

My latest Ideas: Vanity URL Manager l @mention Roles l  LSW Password Policy

Khoros Alumni (Retired)

The default robots.txt file already excludes profile pages and other similar user facing pages intended for browsing like tag and label pages. Ideally your Adobe indexer should adhere to that robots.txt rules. That's probably the first thing I would check.

 

There's an article that explains how you can review and edit your community's robots.txt with a view to exclude certain areas from search: https://community.lithium.com/t5/Search-tools/Exclude-content-from-search-using-robots-txt/ta-p/6770...


Khoros Best Practice until August 2019. Onwards posting as Claudius.
Learn how to master Khoros. Learn Best Practice in the Community Documentation
If you appreciate my efforts, please give me a kudo ↓
Accept as solution to help others find it faster.
Not applicable

excellent stuff, will give a try, thanks very much for your help.

Not applicable

Hi ClaudiusH, thanks for the link to the article, this is useful information that I will pass to the Adobe team.

Welcome to the Technology board!

Curious about our platform? Looking to connect on social technology? You've come to the right place!

Are you a Khoros customer? For direct assistance from our Support team, please visit the Support Forum.