Forum Discussion

aheywoo's avatar
aheywoo
Adept
11 years ago

Search crawl of Forums taking a long time

Hi, we're trying to do a search crawl and index our Forums content using the Google Search Appliance and also the SharePoint search crawler.

We noticed that the crawls were taking a long time (over 30 hours) and that an incremental crawl takes as long as a full crawl.  The incremental crawl should only re-index forum posts that were modified since the last time it crawled them so it's not working properly.

We've narrowed this down to the fact that the 'Last-Modifed' field is missing from each Forum post's response header, so the crawler doesn't know if the page has been modified or not.

Here's an example repsonse header from one of our forum posts: 

 

 Lithium2.png

 

And here's one from another website that does include the Last-Modified field.

 

Lithium1.png

 

This means that the search crawler has to download the whole page and re-index it every time, which causes resource issues and the 30 hour crawl.

Incidentally, this is also the field that a browser uses to determine if the page needs to be downloaded again, of if it can use the one held in it's own cache, so it really should be there.

Do you know if the 'Last-Modified' date can be switched on in the reponse header at all?

Is this a known bug in the forums?

Cheers.

  • We opened a support case for this and got this back:

     

    “...it has been determined that the platform does not offer this feature and there is not a clear workaround.”

     

    So how do we get this onto the radar for a future release?  It should be an easy fix and will benefit everyone using the forums, not just search crawlers.

     

    As a help, if I was writing it in .NET (the only code I know) and the pages were being rendered dynamically, then I would use HttpResponse.AddHeader to add the required header.

    I don't know what the Lithium forums are written in so can't help you there.  Anyone...?

     

     

    • PaoloT's avatar
      PaoloT
      Lithium Alumni (Retired)

      Hi aheywoo 

       

      have you considered posting this idea in the Customer Ideas Exchange ? In that way, it can get more visibility with the product management and get more votes within the community.

       

      I am not part of the engineering department so I cannot really comment on the implementation steps, however based on my personal experience it is likely to be more involved than a one liner change - taking into account that you need to calculate the right date to insert into the header for each page served etc...

       

      Thanks for your suggestion!

      • aheywoo's avatar
        aheywoo
        Adept
        Thanks Paulo, I'll post it there.

        And yes, you're right. It certainly wouldn't be 1 line; but still, not a hard fix.