I tried to make a castle search API with Elasticsearch + Sudachi + Go + echo

Since I made a search API after learning Elasticsearch and Go language, I summarized the problems about how to use and configure the deliverables.

Execution environment https://github.com/takenoko-gohan/castle-search-api-environment Search API https://github.com/takenoko-gohan/castle-search-api

Environment

In environment construction, docker and docker-compose are used.

git clone https://github.com/takenoko-gohan/castle-search-api-environment.git
cd castle-search-api-environment
docker-compose build --no-cache
docker-compose up -d
#Please execute after a while after starting elasticsearch
sh es/script/es_init.sh 

How to Use

When using the search API, make a request in the following form. In the query parameter "keyword", specify the keyword at the time of search. In the query parameter "prefecture", specify the prefecture you want to narrow down. The following command searches for castles whose prefecture is "Fukushima" and contains the keyword "Tsuruga Castle".

curl -XGET "http://localhost:8080/search?keyword=Tsuruga Castle&prefecture=Fukushima Prefecture"

Constitution

Elasticsearch

Index setting

The index was set as follows. At the time of search and index, analyzer uses search mode when dividing into tokens, deletes those whose part of speech is particles, auxiliary verbs, punctuation marks, and commas, and sets the token to change to SudachiNormalizedFormAttribute.

index_settings.json
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "split_mode": "C",
            "discard_punctuation": true,
            "resources_path": "/usr/share/elasticsearch/config/sudachi",
            "settings_path": "/usr/share/elasticsearch/config/sudachi/sudachi.json"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [
              "my_searchfilter",
              "my_posfilter",
              "sudachi_normalizedform"
            ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_searchfilter": {
            "type": "sudachi_split",
            "mode": "search"
          },
          "my_posfilter":{
            "type":"sudachi_part_of_speech",
            "stoptags":[
              "Particle",
              "Auxiliary verb",
              "Auxiliary symbol,Punctuation",
              "Auxiliary symbol,Comma"
            ]
          }
        }
      }
    }
  }
}

Index mapping

The index mapping is as follows.

field type Remarks
name text Castle name
prefectures keyword Prefectures
rulers text Castle owner
description text Castle overview
index_mappings.json
{
  "properties": {
    "name": {"type" : "text", "analyzer": "sudachi_analyzer"},
    "prefecture": {"type": "keyword"},
    "rulers": {"type": "text", "analyzer": "sudachi_analyzer"},
    "description": {"type": "text", "analyzer": "sudachi_analyzer"}
  }
}

document

The search index is Wikipedia's "[Category: Japan's Top 100 Castles](https://ja.wikipedia.org/wiki/Category:%E6%97%A5%E6%9C%AC100%E5%90%8D%E5" The data created based on "% 9F% 8E)" is inserted.

Search API

The search API is the Go language framework "echo" and the Elasticsearch client "go-elasticsearch". I created it using. The API makes it easy to create and execute a query to Elasticsearch based on the parameters received first, and respond to the client as it is for each field of the document that hit the search.

In the search using the query parameter "keyword", the score is weighted in the order of "name> rulers> description" using boost. When searching using the query parameter "prefecture", we try to perform an exact match search for the field "prefecture".

Create query
package search

func createQuery(q *Query) map[string]interface{} {
	query := map[string]interface{}{}
	if q.Keyword != "" && q.Prefecture != "" {
		query = map[string]interface{}{
			"query": map[string]interface{}{
				"bool": map[string]interface{}{
					"must": []map[string]interface{}{
						{
							"bool": map[string]interface{}{
								"should": []map[string]interface{}{
									{
										"match": map[string]interface{}{
											"name": map[string]interface{}{
												"query": q.Keyword,
												"boost": 3,
											},
										},
									},
									{
										"match": map[string]interface{}{
											"rulers": map[string]interface{}{
												"query": q.Keyword,
												"boost": 2,
											},
										},
									},
									{
										"match": map[string]interface{}{
											"description": map[string]interface{}{
												"query": q.Keyword,
												"boost": 1,
											},
										},
									},
								},
								"minimum_should_match": 1,
							},
						},
						{
							"bool": map[string]interface{}{
								"must": []map[string]interface{}{
									{
										"term": map[string]interface{}{
											"prefecture": q.Prefecture,
										},
									},
								},
							},
						},
					},
				},
			},
		}
	} else if q.Keyword != "" && q.Prefecture == "" {
		query = map[string]interface{}{
			"query": map[string]interface{}{
				"bool": map[string]interface{}{
					"should": []map[string]interface{}{
						{
							"match": map[string]interface{}{
								"name": map[string]interface{}{
									"query": q.Keyword,
									"boost": 3,
								},
							},
						},
						{
							"match": map[string]interface{}{
								"rulers": map[string]interface{}{
									"query": q.Keyword,
									"boost": 2,
								},
							},
						},
						{
							"match": map[string]interface{}{
								"description": map[string]interface{}{
									"query": q.Keyword,
									"boost": 1,
								},
							},
						},
					},
					"minimum_should_match": 1,
				},
			},
		}
	} else if q.Keyword == "" && q.Prefecture != "" {
		query = map[string]interface{}{
			"query": map[string]interface{}{
				"bool": map[string]interface{}{
					"must": []map[string]interface{}{
						{
							"term": map[string]interface{}{
								"prefecture": q.Prefecture,
							},
						},
					},
				},
			},
		}
	}

	return query
}

Troublesome place

When I checked the operation after creating it for the time being, I received the following response.

curl -XGET "http://localhost:8080/search?keyword=Wakamatsu Castle&prefectures=Fukushima Prefecture"
{
    "message": "The search was successful.",
    "Results": [
        {
            "name": "Wakamatsu Castle",
            "prefecture": "Fukushima Prefecture",
            "rulers": [
                "Mr. Gamo, Mr. Uesugi, Mr. Kato, Mr. Hoshina, Aizu Matsudaira family"
            ],
            "description": "Wakamatsu Castle is located in Otemachi, Aizuwakamatsu City, Fukushima Prefecture.-It is a Japanese castle that was in 1. Locally, it is generally called Tsurugajo, and outside of the local area, it is often called Aizuwakamatsu Castle. In the history of literature, it is sometimes referred to as Kurokawa Castle or Aizu Castle. As a national historic site, it is designated by the name of Wakamatsu Castle Ruins."
        },
        {
            "name": "Nihonmatsu Castle",
            "prefecture": "Fukushima Prefecture",
            "rulers": [
                "Mr. Kato",
                "Mr. Niwa",
                "Mr. Gamo",
                "Mr. Nihonmatsu",
                "Mr. Uesugi",
                "Date"
            ],
            "description": "Nihonmatsu Castle is a Japanese castle (Hirayama Castle) located in Kakunai, Nihonmatsu City, Fukushima Prefecture. One of Japan's Top 100 Castles. Also known as Kasumigajo / Shirohata Castle. On July 26, 2007, it was designated as a national historic site as the site of Nihonmatsu Castle. It has been selected as one of Japan's Top 100 Cherry Blossom Spots as "Kasumigajo Park"."
        },
        {
            "name": "Shirakawa Komine Castle",
            "prefecture": "Fukushima Prefecture",
            "rulers": [
                "Mr. Matsudaira",
                "Mr. Niwa",
                "Mr. Yuki Shirakawa",
                "Mr. Gamo",
                "Mr. Abe_(Tokugawa Fudai)"
            ],
            "description": "Shirakawa Komine Castle is a Japanese castle located in Shirakawa City, Fukushima Prefecture (Shirakawa, Shirakawa District, Mutsu Province). Also called simply Shirakawa Castle or Komine Castle. It is designated as a national historic site. In addition, it is counted as one of Japan's Top 100 Castles."
        }
    ]
}

The search results assumed that only Wakamatsu Castle would be hit, but other castles in Fukushima Prefecture were also hit. So, when I checked how the following command was analyzed, it seems that Wakamatsu Castle is divided by "Wakamatsu / Castle". Therefore, it seems that other castles were also hit by the "castle" that was divided when searching.

curl -XGET 'http://localhost:9200/castle/_analyze?pretty' -H 'Content-Type: application/json' -d '
{
  "text": "Wakamatsu Castle",
  "analyzer": "sudachi_analyzer"
}'
{
  "tokens" : [
    {
      "token" : "Wakamatsu",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "castle",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

Therefore, referring to this article, create a CSV file in the following format, and create a user dictionary in which the names of each castle are listed in analyzer. Has registered.

Wakamatsu Castle,4786,4786,5000,Wakamatsu Castle,noun,固有noun,General,*,*,*,Wakamatsujo,Wakamatsu Castle,*,*,*,*,*

After registering in the user dictionary, I checked the analysis results, but this time it became the proper noun "Wakamatsu Castle".

curl -XGET 'http://localhost:9200/castle/_analyze?pretty' -H 'Content-Type: application/json' -d '
{
  "text": "Wakamatsu Castle",
  "analyzer": "sudachi_analyzer"
}'
{
  "tokens" : [
    {
      "token" : "Wakamatsu Castle",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

When I searched with the search API again, only Wakamatsu Castle came to hit as expected.

curl -XGET "http://localhost:8080/search?keyword=Wakamatsu Castle&prefecture=Fukushima Prefecture"
{
    "message": "The search was successful.",
    "Results": [
        {
            "name": "Wakamatsu Castle",
            "prefecture": "Fukushima Prefecture",
            "rulers": [
                "Mr. Gamo, Mr. Uesugi, Mr. Kato, Mr. Hoshina, Aizu Matsudaira family"
            ],
            "description": "Wakamatsu Castle is located in Otemachi, Aizuwakamatsu City, Fukushima Prefecture.-It is a Japanese castle that was in 1. Locally, it is generally called Tsurugajo, and outside of the local area, it is often called Aizuwakamatsu Castle. In the history of literature, it is sometimes referred to as Kurokawa Castle or Aizu Castle. As a national historic site, it is designated by the name of Wakamatsu Castle Ruins."
        }
    ]
}

But another problem arose. This time, when I searched for keyword in Wakamatsu and prefecture in Fukushima prefecture, no hits were found. By registering the user dictionary, it seems that Wakamatsu Castle did not hit because it was no longer divided by "Wakamatsu / Castle".

curl -XGET "http://localhost:8080/search?keyword=Wakamatsu&prefecture=Fukushima Prefecture"
{
    "message": "The search was successful.",
    "Results": null
}

According to here, information for dividing into A units can be described in the 16th column of the CSV file. is. Therefore, I modified the CSV file in the following form so that it can be divided into C units and A units in search mode. (Only Wakamatsu Castle, Nihonmatsu Castle, and Shirakawa Komine Castle ...)

Wakamatsu Castle,4786,4786,5000,Wakamatsu Castle,noun,固有noun,General,*,*,*,Wakamatsujo,Wakamatsu Castle,*,C,650091/368637,*,*
Nihonmatsu Castle,4786,4786,5000,Nihonmatsu Castle,noun,固有noun,General,*,*,*,Japanese pine tree,Nihonmatsu Castle,*,C,281483/368637,*,*
Shirakawa Komine Castle,4786,4786,5000,Shirakawa Komine Castle,noun,固有noun,General,*,*,*,Shirakawa Kominejo,Shirakawa Komine Castle,*,C,584799/394859/368637,*,*
curl -XGET 'http://localhost:9200/castle/_analyze?pretty' -H 'Content-Type: application/json' -d '
{
  "text": "Wakamatsu Castle",
  "analyzer": "sudachi_analyzer"
}'
{
  "tokens" : [
    {
      "token" : "Wakamatsu Castle",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "Wakamatsu",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "castle",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    }
  ]
}

Now, even if you search for keyword in Wakamatsu and prefecture in Fukushima prefecture, Wakamatsu Castle will be a hit. It's hard to get the search you want.

curl  -XGET "http://localhost:8080/search?keyword=Wakamatsu Castle&prefecture=Fukushima Prefecture"
{
    "message": "The search was successful.",
    "Results": [
        {
            "name": "Wakamatsu Castle",
            "prefecture": "Fukushima Prefecture",
            "rulers": [
                "Mr. Gamo, Mr. Uesugi, Mr. Kato, Mr. Hoshina, Aizu Matsudaira family"
            ],
            "description": "Wakamatsu Castle is located in Otemachi, Aizuwakamatsu City, Fukushima Prefecture.-It is a Japanese castle that was in 1. Locally, it is generally called Tsurugajo, and outside of the local area, it is often called Aizuwakamatsu Castle. In the history of literature, it is sometimes referred to as Kurokawa Castle or Aizu Castle. As a national historic site, it is designated by the name of Wakamatsu Castle Ruins."
        }
    ]
}

reference

Hands-on to create a user dictionary with Elasticsearch + Sudachi + Docker How to create a Sudachi user dictionary elasticsearch-sudachi README go-elasticsearch README

Recommended Posts

I tried to make a castle search API with Elasticsearch + Sudachi + Go + echo
I tried to make a Web API
I tried to make a mechanism of exclusive control with Go
I tried to make "Sakurai-san" a LINE BOT with API Gateway + Lambda
I tried to make a simple image recognition API with Fast API and Tensorflow
[5th] I tried to make a certain authenticator-like tool with python
Rubyist tried to make a simple API with Python + bottle + MySQL
[2nd] I tried to make a certain authenticator-like tool with python
Make it possible to output a log to a file with go echo
[3rd] I tried to make a certain authenticator-like tool with python
I want to do a full text search with elasticsearch + python
I tried to make a periodical process with Selenium and Python
I tried to make a 2channel post notification application with Python
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
I tried to make a strange quote for Jojo with LSTM
I tried to introduce a serverless chatbot linked with Rakuten API to Teams
Python: I tried to make a flat / flat_map just right with a generator
I tried to make a calculator with Tkinter so I will write it
I tried to make a traffic light-like with Raspberry Pi 4 (Python edition)
I tried to make a url shortening service serverless with AWS CDK
I want to make a game with Python
I tried to make a ○ ✕ game using TensorFlow
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to make a simple mail sending application with tkinter of Python
When I tried to make a VPC with AWS CDK but couldn't make it
[Patent analysis] I tried to make a patent map with Python without spending money
I tried to make a "fucking big literary converter"
I tried to create a table only with Django
I tried to draw a route map with Python
[Go + Gin] I tried to build a Docker environment
I tried to uncover our darkness with Chatwork API
I tried to automatically generate a password with Python3
I tried to make an OCR application with PySimpleGUI
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
I tried Learning-to-Rank with Elasticsearch!
I tried to make creative art with AI! I programmed a novelty! (Paper: Creative Adversarial Network)
I tried to create a class to search files with Python's Glob method in VBA
I tried to implement a volume moving average with Quantx
I tried to search videos using Youtube Data API (beginner)
I tried to make various "dummy data" with Python faker
I want to make a blog editor with django admin
I want to make a click macro with pyautogui (desire)
I tried to solve a combination optimization problem with Qiskit
I want to make a click macro with pyautogui (outlook)
I tried to get started with Hy ・ Define a class
I tried to sort a random FizzBuzz column with bubble sort.
I tried to make a stopwatch using tkinter in python
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to divide with a deep learning language model
I tried to make a simple text editor using PyQt
[1 hour challenge] I tried to make a fortune-telling site that is too suitable with Python
I tried to make a document search slack command using Kendra announced at re: Invent 2019.
I tried to make a generator that generates a C # container class from CSV with Python
I tried to make a motion detection surveillance camera with OpenCV using a WEB camera with Raspberry Pi
I tried to create Quip API
I tried to touch Tesla's API
I tried to make deep learning scalable with Spark × Keras × Docker
A memorandum when I tried to get it automatically with selenium
I tried to make a regular expression of "amount" using Python