Khoros Atlas Logo

Frequent Advisor
Frequent Advisor

Issues with Elastic Search in double byte languages

Greetings,

 

We recently discovered a limitation in Lithium's search capability in Japanese. We escalated this to support and unfortunately this is a bug and limitation in the existing Elastic Search platform.

 

Are other platform users seeing similar issues in Japanese, Simplified Chinese, Korean or Traditional Chinese?

 

Here are the details below from our engagement with Lithium support.

 

Question:

 

Does Lithium give you any control over search engine behavior or search algorithms used by language?

I noticed that Dojo’s search is not Japanese-friendly. It uses logic that casts a very large net to find search terms.
For example, if I search “timezone” in English, I get any results containing that exact word.

If I search “タイムゾーン” (timezone) in Japanese, I get a whole slew of unrelated results.
Lithiums behavior for Japanese is basically to break down the search term into illogically small parts and and return the ones it things are most related.
As a result, it is doing things like returning every result with the letter “t” or letters “zo”.

Pretty sure its trying to be helpful, but we need to change this for Japanese if possible. Any options?

 

Here is a link in case its helpful. I noticed search terms are bold, so even without understanding Japanese you can see the fragmented results.
https://dojo.domo.com/t5/forums/searchpage/tab/message?q=%22%E3%82%BF%E3%82%A4%E3%83%A0%E3%82%BE%E3%...

Lithium's response:

 

I think we have run into a limitation of Search on our platform. Your community is now using ElasticSearch, which is the latest and greatest search technology on the market, but it appears that it has challenges with Asian character sets, especially when there is no space between words.

We have filed a bug ticket for this issue however I am not sure if Engineering will be able to do anything here. I will take over communication on the case and let you know of any updates from Engineering.

 

@Wendy_S @Chao @debbie @thayerg you may be interested in this recent discovery. Let's raise this collectively to Lithium engineering to put some focus on getting this fixed.

 

Thanks and hope everyone is doing well!

Regards,
Dani

 

 

0 Kudos
8 Replies 8
Lithium Alumni (Retired) Lithium Alumni (Retired)
Lithium Alumni (Retired)

Re: Issues with Elastic Search in double byte languages

There could be other reasons why the search in Japanese has poor results on the Domo community, the community software does some additional processing on the search query before it actually reaches the search engine. We're actively working to resolve these issues to improve the search experience in the community.
0 Kudos
Reply
Loading...
Frequent Advisor
Frequent Advisor

Re: Issues with Elastic Search in double byte languages

Thanks Brian,

 

The Japanese language pack has been out for some time and this is the first I have heard of such issues. The testing we are doing replicates local customer behavior so we would expect the search result results to be returned in a usable manner.

 

Regards,

Dani

0 Kudos
Reply
Loading...
Esteemed Contributor
Esteemed Contributor

Re: Issues with Elastic Search in double byte languages

I don't recall seeing this type of issue when I was managing Lithium communities with Chinese nodes. I did a quick test in HP's Chinese forums and the results are relevant:

 

http://h30471.www3.hp.com/t5/forums/searchpage/tab/message?filter=location&q=固态硬盘+&location=forum-bo...

 

I wonder if this is only limited to how Japanese Kana scripts are analyzed by the Elastic tokenizer plugin Lithium is using.  It looks like your search example produce results with characters far from each other (even though quotes are used) - indicating that search has no idea how to tokenize your query - or break them into words. However, if you search for a Japanese term with Kanji, the results are fairly relevant: https://dojo.domo.com/t5/forums/searchpage/tab/message?q=時間+

 

This article has some interesting info on challenges related to Elastic tokenization and various Asian languages:

https://gibrown.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/

 

 

Reply
Loading...
Frequent Advisor
Frequent Advisor

Re: Issues with Elastic Search in double byte languages

Thanks @ac,

 

This is very good information which I will share with the Japan team.

 

Regards,

Dani

0 Kudos
Reply
Loading...
Advisor
Advisor

Re: Issues with Elastic Search in double byte languages

Thanks Dani. I am interested. 

 

The Lithium Community Search was doing badly in HP Chinese Community. I raised a ticket to Lithium in June 2016. 

---------------------------------------------------------------------------

E.g. The top 1 is "1510". The users who searched "1510" should have HP Deskjet 1510. But the keyword "1510" has no result in the whole community. If searching "Deskjet1510" or "DJ1510" or "Deskjet 1510", there are a lot of results.

* There are a lot of posts and post titles including "1510". No search result for "1510" looks an issue to me.

* The search results for "Deskjet1510" and "DJ1510" are good. All the posts are including "Deskjet1510" or "DJ1510. They are what the users want.

* I searched "Deskjet 1510", but many results only including "Deskjet" are ahead of the results including "Deskjet 1510". The order of the result is Best Match. This is not making sense. See the attach screenshot.

..................................................................................................................

 

Lithium suggested me to wait for new ElasticSearch upgrade in July. Then magic happened. All the above issues were fixed. The unanswered search % in Chinese community decreased from 30% to 4%. For now, I don't see any outstanding search issue in Chinese and Korean communities. But I will keep monitor the user searching behavior and do more testing. 

 

In Lithium Response "it appears that it has challenges with Asian character sets, especially when there is no space between words.". As I know Chinese has no space between words, but Korean has space between words.

 

Chao

 

Reply
Loading...
Frequent Advisor
Frequent Advisor

Re: Issues with Elastic Search in double byte languages

Thank you @Chao for sharing your insights and glad to hear there has been great improvement in Simplified Chinese. I am hopeful that the same will happen soon for Japanese.

 

@thayerg it would be great to get your perspective in how Japanese is fairing in the HPE Community.

 

Regards,

Dani

0 Kudos
Reply
Loading...
Honored Contributor Honored Contributor
Honored Contributor

Re: Issues with Elastic Search in double byte languages

Thanks for posting this. We have multiple boards that use double byte characters. However, we had Elastic Search disabled due to a bug in which searching by email does not work.  I'll have to keep an eye on the progress of this issue if we ever make the switch to Elastic Search again.

----------------------------------

Lili McDonald
Sr. Digital Business Manager - Community
@ National Instruments
Reply
Loading...
Honored Contributor Honored Contributor
Honored Contributor

Re: Issues with Elastic Search in double byte languages

Thanks for sharing Dani. I am going to follow these conversations. I see Chao is already on it as well 🙂
Learning from others and helping where I can!
Community Passionista!
0 Kudos
Reply
Loading...