Search crawl of Forums taking a long time
Hi, we're trying to do a search crawl and index our Forums content using the Google Search Appliance and also the SharePoint search crawler.
We noticed that the crawls were taking a long time (over 30 hours) and that an incremental crawl takes as long as a full crawl. The incremental crawl should only re-index forum posts that were modified since the last time it crawled them so it's not working properly.
We've narrowed this down to the fact that the 'Last-Modifed' field is missing from each Forum post's response header, so the crawler doesn't know if the page has been modified or not.
Here's an example repsonse header from one of our forum posts:
And here's one from another website that does include the Last-Modified field.
This means that the search crawler has to download the whole page and re-index it every time, which causes resource issues and the 30 hour crawl.
Incidentally, this is also the field that a browser uses to determine if the page needs to be downloaded again, of if it can use the one held in it's own cache, so it really should be there.
Do you know if the 'Last-Modified' date can be switched on in the reponse header at all?
Is this a known bug in the forums?
Cheers.