We recently discovered a limitation in Lithium's search capability in Japanese. We escalated this to support and unfortunately this is a bug and limitation in the existing Elastic Search platform.
Are other platform users seeing similar issues in Japanese, Simplified Chinese, Korean or Traditional Chinese?
Here are the details below from our engagement with Lithium support.
Does Lithium give you any control over search engine behavior or search algorithms used by language?
I noticed that Dojo’s search is not Japanese-friendly. It uses logic that casts a very large net to find search terms.
For example, if I search “timezone” in English, I get any results containing that exact word.
If I search “タイムゾーン” (timezone) in Japanese, I get a whole slew of unrelated results.
Lithiums behavior for Japanese is basically to break down the search term into illogically small parts and and return the ones it things are most related.
As a result, it is doing things like returning every result with the letter “t” or letters “zo”.
Pretty sure its trying to be helpful, but we need to change this for Japanese if possible. Any options?
Here is a link in case its helpful. I noticed search terms are bold, so even without understanding Japanese you can see the fragmented results.
I think we have run into a limitation of Search on our platform. Your community is now using ElasticSearch, which is the latest and greatest search technology on the market, but it appears that it has challenges with Asian character sets, especially when there is no space between words.
We have filed a bug ticket for this issue however I am not sure if Engineering will be able to do anything here. I will take over communication on the case and let you know of any updates from Engineering.
Thanks and hope everyone is doing well!
The Japanese language pack has been out for some time and this is the first I have heard of such issues. The testing we are doing replicates local customer behavior so we would expect the search result results to be returned in a usable manner.
I don't recall seeing this type of issue when I was managing Lithium communities with Chinese nodes. I did a quick test in HP's Chinese forums and the results are relevant:
I wonder if this is only limited to how Japanese Kana scripts are analyzed by the Elastic tokenizer plugin Lithium is using. It looks like your search example produce results with characters far from each other (even though quotes are used) - indicating that search has no idea how to tokenize your query - or break them into words. However, if you search for a Japanese term with Kanji, the results are fairly relevant: https://dojo.domo.com/t5/forums/searchpage/tab/message?q=時間+
This article has some interesting info on challenges related to Elastic tokenization and various Asian languages:
Thanks Dani. I am interested.
The Lithium Community Search was doing badly in HP Chinese Community. I raised a ticket to Lithium in June 2016.
E.g. The top 1 is "1510". The users who searched "1510" should have HP Deskjet 1510. But the keyword "1510" has no result in the whole community. If searching "Deskjet1510" or "DJ1510" or "Deskjet 1510", there are a lot of results.
* There are a lot of posts and post titles including "1510". No search result for "1510" looks an issue to me.
* The search results for "Deskjet1510" and "DJ1510" are good. All the posts are including "Deskjet1510" or "DJ1510. They are what the users want.
* I searched "Deskjet 1510", but many results only including "Deskjet" are ahead of the results including "Deskjet 1510". The order of the result is Best Match. This is not making sense. See the attach screenshot.
Lithium suggested me to wait for new ElasticSearch upgrade in July. Then magic happened. All the above issues were fixed. The unanswered search % in Chinese community decreased from 30% to 4%. For now, I don't see any outstanding search issue in Chinese and Korean communities. But I will keep monitor the user searching behavior and do more testing.
In Lithium Response "it appears that it has challenges with Asian character sets, especially when there is no space between words.". As I know Chinese has no space between words, but Korean has space between words.
Thanks for posting this. We have multiple boards that use double byte characters. However, we had Elastic Search disabled due to a bug in which searching by email does not work. I'll have to keep an eye on the progress of this issue if we ever make the switch to Elastic Search again.