This article is also posted on VASILY DEVELOPERS BLOG with the same content. Please see other articles if you like.

Hello, this is Shiozaki of the back-end engineers. Until now, iQON's full-text search index used only morphological analysis, but the other day I improved the search by using Ngram as well. As a result, the number of hits in the search results has improved, and the increase in search noise has been suppressed to a minor level.

In this article, I will introduce the advantages of using Ngram together and how to use it with Apache Solr.

I can't find the information I want

In the first place, what is the state of "searching but not finding the information you want"? Here, let's break down that state into the following two states.

The amount of information I want is small

The first state is "the information you want is less in the search results". For example, when you search for "Tokyo" on a travel information site, there are thousands of data in the DB, but there are only a few search results.

スクリーンショット 2017-02-15 20.26.09.png

A lot of information you don't want

The second state is "there is a lot of unwanted information in the search results". When you search for "Tokyo", the search results include information on other areas such as Kyoto and Osaka in addition to the information on Tokyo.

スクリーンショット 2017-02-15 20.26.46.png

In fact, two states are happening at the same time

In many cases, the above two states occur at the same time when you cannot find what you really want. In other words, not all the information that the user wants is returned as the search result, and the information that the user does not want is also returned as the search result.

The Venn diagram is as follows. All the information stored in the DB is classified according to the two axes of whether it was returned as a search result and whether it was the information you wanted.

スクリーンショット 2017-02-15 18.26.25.png

Then name each set for further explanation. "Correct result" is returned as the search result and the information requested by the user is returned as the search result, but "search noise" is returned as the information not requested by the user. Information that is not included in the results is called "search omission".

By the way, in the field of information retrieval, they are called "True Positive", "Flase Positive", and "False Negative" respectively.

In this figure, the state with many search omissions corresponds to the first state explained above, and the state with many search noises corresponds to the second state. Improving search results is nothing more than reducing the number of search noises and omissions and matching the search results with the information you want.

However, reducing these two is a trade-off. Often, improving one will make the other worse.

Types of search indexes and their advantages / disadvantages

This section describes the index processing performed in the full-text search engine and the word decomposition performed before that. Full-text search engines generate an index called an inverted index to speed up the search process. The inverted index is an associative array whose key is a word in a sentence and whose value is the array of documents in which the word appears.

This is explained in detail in the first half of the TECH BLOG article published the other day, so please have a look if you like. http://tech.vasily.jp/entry/solr6-neologd

In order to generate an inverted index, the document needs to be decomposed into words. There are two methods for word decomposition, such as Japanese, in which words are not separated by spaces, Ngram and morphological analysis.

I will explain each feature.

Ngram Ngram is a method of mechanically dividing sentences into N characters and breaking them down into words. The N part changes depending on how many characters are separated, especially when N = 2 is called bigram and when N = 3 is called trigram. From now on, N = 2 (bigram) will be used for explanation.

bigram separates the document by two letters and considers the result as a word. For example, the document "Tokyo Metropolitan Art Museum" is broken down into the following five words.

"Tokyo" "Kyoto" "Tomi" "Art" "Art Museum"

Then, when doing a search, the search query is similarly split into words and combined with their AND. For example, for the search query "Museum", search for "Art AND Art Museum".

The advantage of Ngram is that this ensures that there are no missing searches for any substring of N characters or more. On the other hand, the drawback of Ngram is that this "Tokyo Metropolitan Art Museum" is hit even for the search query "Kyoto".

Benefits: Guaranteed partial match search Disadvantages: High search noise

Morphological analysis

For word division by morphological analysis, word division is performed in grammatically meaningful units using a dictionary prepared in advance. Therefore, the advantage is that problems when using Ngram are unlikely to occur.

On the other hand, it is necessary to prepare a dictionary in advance, and if the performance of the dictionary is low, the search performance will deteriorate. For example, if you break down "Right of Foreigners to Vote" into "Foreign", "Ginseng", and "Government", the document will not hit the search query "Suffrage".

In addition, kuromoji, which is a morphological analyzer installed as standard in Solr and ElasticSearch, uses an IPA dictionary, but this dictionary has a weakness that it is not so strong against proper nouns in a specific area. As a result, proper nouns that should originally be one word are often broken down into multiple words. Example: "L'Occitane" → "L'Occitane" "Sitan"

On the other hand, there is another problem in using proper nouns as one word. For example, if you index the proper noun "Kansai International Airport" as a single word, documents containing "Kansai International Airport" will not be hit in search queries such as "International Airport" and "Airport". To solve this problem, kuromoji has a function to further divide the word "Kansai International Airport" and index a total of 4 words including the 3 words "Kansai", "International" and "Airport". However, such behavior also depends on the dictionary and morphological analyzer, and there is no guarantee that partial match search will be possible.

Advantages: Low search noise Disadvantages: Many searches are missed depending on the performance of the dictionary

Combined use of morphological analysis and Ngram

The advantages and disadvantages of Ngram and morphological analysis are summarized in the table below in terms of the number of search noises and the number of search omissions.

	Search noise	Search omission
Ngram	Many	Few
Morphological analysis	Few	Many

This figure also shows that there is a trade-off between reducing search noise and reducing search omissions.

However, by using these two together, it is possible to reduce the number of search omissions and the number of search noises substantially compared to the case where each is used alone.

The specific method for using them together is as follows.

** Index generation process **

Generate an Ngram index for a document
Generate an index by morphological analysis

** Search process **

Search Ngram and index by morphological analysis at the same time
Merge the search results with weights so that the index results from the morphological analysis are likely to be at the top.

スクリーンショット 2017-02-15 20.49.32.png

Reduce search omissions by merging two search results. However, simply merging will increase search noise. The key to this combined processing is to merge the results of the morphological analysis so that they are likely to be at the beginning. When users look at search results, they start from the beginning, so if there is information they want at the beginning, they may be satisfied and stop looking at subsequent results. In such cases, it can be considered that the search noise is not substantially increased.

How to make a query with Solr that uses morphological analysis and Ngram together

Explains how to write a specific Solr configuration file and how to write a query.

I have confirmed these behaviors in Solr 6.2.1, but I don't write in a specific version, so I think it will work in other versions of Solr.

Index generation method

First is the index generation part. Write the following three pieces of information to managed-schema.

Field type definition

We have defined text_ja_ngram, which is a fieldType for index generation in Ngram, and text_ja, which is a field for index generation in morphological analysis.

<fieldType name="text_ja_ngram" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/>
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_ja" class="solr.TextField" autoGeneratePhraseQueries="false" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.ICUNormalizer2CharFilterFactory" name="nfkc"/>
    <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.JapaneseBaseFormFilterFactory"/>
    <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
    <filter class="solr.StopFilterFactory" words="lang/stopwords_ja.txt" ignoreCase="true"/>
    <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Field definition

Then define the fields for these fieldTypes.

<field name="search_ngram" type="text_ja_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
<field name="search" type="text_ja" multiValued="true" indexed="true" required="false" stored="true"/>

Copying data from other fields

Finally, use copyField to copy the field you want to search into the above two fields.

<copyField source="title" dest="search_ngram" />
<copyField source="title" dest="search" />
<copyField source="description" dest="search_ngram" />
<copyField source="description" dest="search" />
...

This completes the setting of the index generation part. Let's try and confirm that these are working properly with the Analysis function.

Index by Ngram
Index by morphological analysis

スクリーンショット 2017-02-13 20.30.23.png

How to throw a query

Here's a query to throw against these fields:

Use the eDisMax query to search across multiple fields. You can search using both fields by querying with the following parameters.

q =
defType=edismax
qf=search^100+search_ngram^50

The 100 and 50 specified by qf are the weights of the respective fields. The higher the weight, the easier it is for the document that hits the field to appear first. For tuning around here, it is necessary to change the value while looking at the search results.

Summary

By using morphological analysis and Ngram together, we were able to reduce search omissions and suppress the increase in search noise. As a result, iQON has made it possible to search for products by brand name abbreviation. For example, you can now hit the product "JIMMY CHOO" with the search query "JIMMY".

[JAVA] How to realize hybrid search using morphological analysis and Ngram with Solr