Since I made a search API after learning Elasticsearch and Go language, I summarized the problems about how to use and configure the deliverables.
Execution environment https://github.com/takenoko-gohan/castle-search-api-environment Search API https://github.com/takenoko-gohan/castle-search-api
In environment construction, docker and docker-compose are used.
git clone https://github.com/takenoko-gohan/castle-search-api-environment.git
cd castle-search-api-environment
docker-compose build --no-cache
docker-compose up -d
#Please execute after a while after starting elasticsearch
sh es/script/es_init.sh
When using the search API, make a request in the following form. In the query parameter "keyword", specify the keyword at the time of search. In the query parameter "prefecture", specify the prefecture you want to narrow down. The following command searches for castles whose prefecture is "Fukushima" and contains the keyword "Tsuruga Castle".
curl -XGET "http://localhost:8080/search?keyword=Tsuruga Castle&prefecture=Fukushima Prefecture"
Elasticsearch
The index was set as follows. At the time of search and index, analyzer uses search mode when dividing into tokens, deletes those whose part of speech is particles, auxiliary verbs, punctuation marks, and commas, and sets the token to change to SudachiNormalizedFormAttribute.
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer",
"split_mode": "C",
"discard_punctuation": true,
"resources_path": "/usr/share/elasticsearch/config/sudachi",
"settings_path": "/usr/share/elasticsearch/config/sudachi/sudachi.json"
}
},
"analyzer": {
"sudachi_analyzer": {
"filter": [
"my_searchfilter",
"my_posfilter",
"sudachi_normalizedform"
],
"tokenizer": "sudachi_tokenizer",
"type": "custom"
}
},
"filter":{
"my_searchfilter": {
"type": "sudachi_split",
"mode": "search"
},
"my_posfilter":{
"type":"sudachi_part_of_speech",
"stoptags":[
"Particle",
"Auxiliary verb",
"Auxiliary symbol,Punctuation",
"Auxiliary symbol,Comma"
]
}
}
}
}
}
}
The index mapping is as follows.
field | type | Remarks |
---|---|---|
name | text | Castle name |
prefectures | keyword | Prefectures |
rulers | text | Castle owner |
description | text | Castle overview |
{
"properties": {
"name": {"type" : "text", "analyzer": "sudachi_analyzer"},
"prefecture": {"type": "keyword"},
"rulers": {"type": "text", "analyzer": "sudachi_analyzer"},
"description": {"type": "text", "analyzer": "sudachi_analyzer"}
}
}
The search index is Wikipedia's "[Category: Japan's Top 100 Castles](https://ja.wikipedia.org/wiki/Category:%E6%97%A5%E6%9C%AC100%E5%90%8D%E5" The data created based on "% 9F% 8E)" is inserted.
The search API is the Go language framework "echo" and the Elasticsearch client "go-elasticsearch". I created it using. The API makes it easy to create and execute a query to Elasticsearch based on the parameters received first, and respond to the client as it is for each field of the document that hit the search.
In the search using the query parameter "keyword", the score is weighted in the order of "name> rulers> description" using boost. When searching using the query parameter "prefecture", we try to perform an exact match search for the field "prefecture".
package search
func createQuery(q *Query) map[string]interface{} {
query := map[string]interface{}{}
if q.Keyword != "" && q.Prefecture != "" {
query = map[string]interface{}{
"query": map[string]interface{}{
"bool": map[string]interface{}{
"must": []map[string]interface{}{
{
"bool": map[string]interface{}{
"should": []map[string]interface{}{
{
"match": map[string]interface{}{
"name": map[string]interface{}{
"query": q.Keyword,
"boost": 3,
},
},
},
{
"match": map[string]interface{}{
"rulers": map[string]interface{}{
"query": q.Keyword,
"boost": 2,
},
},
},
{
"match": map[string]interface{}{
"description": map[string]interface{}{
"query": q.Keyword,
"boost": 1,
},
},
},
},
"minimum_should_match": 1,
},
},
{
"bool": map[string]interface{}{
"must": []map[string]interface{}{
{
"term": map[string]interface{}{
"prefecture": q.Prefecture,
},
},
},
},
},
},
},
},
}
} else if q.Keyword != "" && q.Prefecture == "" {
query = map[string]interface{}{
"query": map[string]interface{}{
"bool": map[string]interface{}{
"should": []map[string]interface{}{
{
"match": map[string]interface{}{
"name": map[string]interface{}{
"query": q.Keyword,
"boost": 3,
},
},
},
{
"match": map[string]interface{}{
"rulers": map[string]interface{}{
"query": q.Keyword,
"boost": 2,
},
},
},
{
"match": map[string]interface{}{
"description": map[string]interface{}{
"query": q.Keyword,
"boost": 1,
},
},
},
},
"minimum_should_match": 1,
},
},
}
} else if q.Keyword == "" && q.Prefecture != "" {
query = map[string]interface{}{
"query": map[string]interface{}{
"bool": map[string]interface{}{
"must": []map[string]interface{}{
{
"term": map[string]interface{}{
"prefecture": q.Prefecture,
},
},
},
},
},
}
}
return query
}
When I checked the operation after creating it for the time being, I received the following response.
curl -XGET "http://localhost:8080/search?keyword=Wakamatsu Castle&prefectures=Fukushima Prefecture"
{
"message": "The search was successful.",
"Results": [
{
"name": "Wakamatsu Castle",
"prefecture": "Fukushima Prefecture",
"rulers": [
"Mr. Gamo, Mr. Uesugi, Mr. Kato, Mr. Hoshina, Aizu Matsudaira family"
],
"description": "Wakamatsu Castle is located in Otemachi, Aizuwakamatsu City, Fukushima Prefecture.-It is a Japanese castle that was in 1. Locally, it is generally called Tsurugajo, and outside of the local area, it is often called Aizuwakamatsu Castle. In the history of literature, it is sometimes referred to as Kurokawa Castle or Aizu Castle. As a national historic site, it is designated by the name of Wakamatsu Castle Ruins."
},
{
"name": "Nihonmatsu Castle",
"prefecture": "Fukushima Prefecture",
"rulers": [
"Mr. Kato",
"Mr. Niwa",
"Mr. Gamo",
"Mr. Nihonmatsu",
"Mr. Uesugi",
"Date"
],
"description": "Nihonmatsu Castle is a Japanese castle (Hirayama Castle) located in Kakunai, Nihonmatsu City, Fukushima Prefecture. One of Japan's Top 100 Castles. Also known as Kasumigajo / Shirohata Castle. On July 26, 2007, it was designated as a national historic site as the site of Nihonmatsu Castle. It has been selected as one of Japan's Top 100 Cherry Blossom Spots as "Kasumigajo Park"."
},
{
"name": "Shirakawa Komine Castle",
"prefecture": "Fukushima Prefecture",
"rulers": [
"Mr. Matsudaira",
"Mr. Niwa",
"Mr. Yuki Shirakawa",
"Mr. Gamo",
"Mr. Abe_(Tokugawa Fudai)"
],
"description": "Shirakawa Komine Castle is a Japanese castle located in Shirakawa City, Fukushima Prefecture (Shirakawa, Shirakawa District, Mutsu Province). Also called simply Shirakawa Castle or Komine Castle. It is designated as a national historic site. In addition, it is counted as one of Japan's Top 100 Castles."
}
]
}
The search results assumed that only Wakamatsu Castle would be hit, but other castles in Fukushima Prefecture were also hit. So, when I checked how the following command was analyzed, it seems that Wakamatsu Castle is divided by "Wakamatsu / Castle". Therefore, it seems that other castles were also hit by the "castle" that was divided when searching.
curl -XGET 'http://localhost:9200/castle/_analyze?pretty' -H 'Content-Type: application/json' -d '
{
"text": "Wakamatsu Castle",
"analyzer": "sudachi_analyzer"
}'
{
"tokens" : [
{
"token" : "Wakamatsu",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "castle",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
}
]
}
Therefore, referring to this article, create a CSV file in the following format, and create a user dictionary in which the names of each castle are listed in analyzer. Has registered.
Wakamatsu Castle,4786,4786,5000,Wakamatsu Castle,noun,固有noun,General,*,*,*,Wakamatsujo,Wakamatsu Castle,*,*,*,*,*
After registering in the user dictionary, I checked the analysis results, but this time it became the proper noun "Wakamatsu Castle".
curl -XGET 'http://localhost:9200/castle/_analyze?pretty' -H 'Content-Type: application/json' -d '
{
"text": "Wakamatsu Castle",
"analyzer": "sudachi_analyzer"
}'
{
"tokens" : [
{
"token" : "Wakamatsu Castle",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}
]
}
When I searched with the search API again, only Wakamatsu Castle came to hit as expected.
curl -XGET "http://localhost:8080/search?keyword=Wakamatsu Castle&prefecture=Fukushima Prefecture"
{
"message": "The search was successful.",
"Results": [
{
"name": "Wakamatsu Castle",
"prefecture": "Fukushima Prefecture",
"rulers": [
"Mr. Gamo, Mr. Uesugi, Mr. Kato, Mr. Hoshina, Aizu Matsudaira family"
],
"description": "Wakamatsu Castle is located in Otemachi, Aizuwakamatsu City, Fukushima Prefecture.-It is a Japanese castle that was in 1. Locally, it is generally called Tsurugajo, and outside of the local area, it is often called Aizuwakamatsu Castle. In the history of literature, it is sometimes referred to as Kurokawa Castle or Aizu Castle. As a national historic site, it is designated by the name of Wakamatsu Castle Ruins."
}
]
}
But another problem arose. This time, when I searched for keyword in Wakamatsu and prefecture in Fukushima prefecture, no hits were found. By registering the user dictionary, it seems that Wakamatsu Castle did not hit because it was no longer divided by "Wakamatsu / Castle".
curl -XGET "http://localhost:8080/search?keyword=Wakamatsu&prefecture=Fukushima Prefecture"
{
"message": "The search was successful.",
"Results": null
}
According to here, information for dividing into A units can be described in the 16th column of the CSV file. is. Therefore, I modified the CSV file in the following form so that it can be divided into C units and A units in search mode. (Only Wakamatsu Castle, Nihonmatsu Castle, and Shirakawa Komine Castle ...)
Wakamatsu Castle,4786,4786,5000,Wakamatsu Castle,noun,固有noun,General,*,*,*,Wakamatsujo,Wakamatsu Castle,*,C,650091/368637,*,*
Nihonmatsu Castle,4786,4786,5000,Nihonmatsu Castle,noun,固有noun,General,*,*,*,Japanese pine tree,Nihonmatsu Castle,*,C,281483/368637,*,*
Shirakawa Komine Castle,4786,4786,5000,Shirakawa Komine Castle,noun,固有noun,General,*,*,*,Shirakawa Kominejo,Shirakawa Komine Castle,*,C,584799/394859/368637,*,*
curl -XGET 'http://localhost:9200/castle/_analyze?pretty' -H 'Content-Type: application/json' -d '
{
"text": "Wakamatsu Castle",
"analyzer": "sudachi_analyzer"
}'
{
"tokens" : [
{
"token" : "Wakamatsu Castle",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0,
"positionLength" : 2
},
{
"token" : "Wakamatsu",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "castle",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
}
]
}
Now, even if you search for keyword in Wakamatsu and prefecture in Fukushima prefecture, Wakamatsu Castle will be a hit. It's hard to get the search you want.
curl -XGET "http://localhost:8080/search?keyword=Wakamatsu Castle&prefecture=Fukushima Prefecture"
{
"message": "The search was successful.",
"Results": [
{
"name": "Wakamatsu Castle",
"prefecture": "Fukushima Prefecture",
"rulers": [
"Mr. Gamo, Mr. Uesugi, Mr. Kato, Mr. Hoshina, Aizu Matsudaira family"
],
"description": "Wakamatsu Castle is located in Otemachi, Aizuwakamatsu City, Fukushima Prefecture.-It is a Japanese castle that was in 1. Locally, it is generally called Tsurugajo, and outside of the local area, it is often called Aizuwakamatsu Castle. In the history of literature, it is sometimes referred to as Kurokawa Castle or Aizu Castle. As a national historic site, it is designated by the name of Wakamatsu Castle Ruins."
}
]
}
Hands-on to create a user dictionary with Elasticsearch + Sudachi + Docker How to create a Sudachi user dictionary elasticsearch-sudachi README go-elasticsearch README
Recommended Posts