We have a problem with our internal search engine (powered by Adobe) which is indexing over half a million pages of our community and reaches the limit of indexable pages. How can we limit the number of indexable pages? Is there a page identifier in the URL that we can use for exclusion?
Solved! Go to Solution.
Hi,
if you think about user profiles and all the topics you may have its not a surprise that you have millions of results it's actually why using a 3rd party search solution for the community "for me" isnt very viable.
How does adobe pull the content to index, does it crawl your site or use api pull? I've never used the adobe solution so not sure what level of filtering you have but you should be able to limit certain pages for example:
User profile pages will all have:
/t5/user/viewprofilepage/user-id/?????
The bulk of your results "i suspect" will be topics and contain (/td-p/) however there isnt any way to filter out old content as ther date paretment isnt in the url. If you're trying to limit/control whats indexed, you may have to splice your content into different boards and have certain boards excluded from search whilst others are included.
I dont have a 3rd party search but i do somethign similar for organic search, i have an archive board which is blocked by robots, old content gets pushed into that board, its also hidden so it doesnt appear in the sites local search and we activly remove from search console or restore content if gettign large traffic whilst in the archive.
Stephen
Checkout some of the stuff i've built using the platform:
Community l Ideation l Blog l Product Hubs l Check & Report l Service Status
My latest Ideas: Vanity URL Manager l @mention Roles l LSW Password Policy
Hi Stephen, thanks very much for the comprehensive and quick response, I think the approach of separating current and old content and block out the latter may be the right one. I understand from your reply that you use the robot.txt file to block content indexing, right? Is this something you do in Studio? Do you have a global robot file or do you have local files for each board?
Stephen
Checkout some of the stuff i've built using the platform:
Community l Ideation l Blog l Product Hubs l Check & Report l Service Status
My latest Ideas: Vanity URL Manager l @mention Roles l LSW Password Policy
The default robots.txt file already excludes profile pages and other similar user facing pages intended for browsing like tag and label pages. Ideally your Adobe indexer should adhere to that robots.txt rules. That's probably the first thing I would check.
There's an article that explains how you can review and edit your community's robots.txt with a view to exclude certain areas from search: https://community.lithium.com/t5/Search-tools/Exclude-content-from-search-using-robots-txt/ta-p/6770...
excellent stuff, will give a try, thanks very much for your help.
Hi ClaudiusH, thanks for the link to the article, this is useful information that I will pass to the Adobe team.
Welcome to the Technology board!
Curious about our platform? Looking to connect on social technology? You've come to the right place!