Recently, I started using Elasticsearch for work. As a memo, I will write down what I used and learned from data entry to scoring.
It will be quite long to write everything, so I will write it separately in the front and back. The first part is from installation to data input In the second part, I will talk about search and scoring.
For the time being, install what you need. The development environment is CentOS7.
Elasticsearch requires at least Java 8. Specifically as of this writing, it is recommended that you use the Oracle JDK version 1.8.0_73.
Requires Java version 8 or higher.
First, install Java 8. Proceed while referring to here. If you already have an older version of Java 1.7, the link above also tells you how to switch to the Java VM.
You can remove Java 7 and reinstall Java 8.
$ sudo yum remove -y java-1.7.0-openjdk
$ sudo yum install -y java-1.8.0-openjdk-devel
$ sudo yum install -y java-1.8.0-openjdk-debuginfo --enablerepo=*debug*
Version confirmation
$ java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
You can install older versions of 2.x with yum. Since Elasticsearch 5.0 is used here, install Elasticsearch 5.0. (It seems that 6 has been released recently.) The installation should follow the steps in Elasticsearch Docs, but the startup did not work. (; ∀ ;)
Install from rpm as Alternative. (Currently, 6 cannot be installed at rpm.)
# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# vim /etc/yum.repos.d/elasticsearch.repo
[elasticsearch-5.x]
name=Elasticsearch repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
# yum install elasticsearch
# systemctl enable elasticsearch
# systemctl start elasticsearch
Test if it could be started
# curl localhost:9200
{
"name" : "3Y-W_H1",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "cYenb8Q8S22EHcxJPL7k2Q",
"version" : {
"number" : "5.0.0",
"build_hash" : "253032b",
"build_date" : "2016-10-26T04:37:51.531Z",
"build_snapshot" : false,
"lucene_version" : "6.2.0"
},
"tagline" : "You Know, for Search"
}
You were able to start it.
Refer to Elastic Doc.
# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# vim /etc/yum.repos.d/kibana.repo
[kibana-5.x]
name=Kibana repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
# yum install kibana
# systemctl enable kibana
# systemctl start kibana
Set up the connection.
# vim /etc/kibana/kibana.yml
network.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"
# systemctl restart kibana
Let's open it in a browser.
http://192.168.216.128:5601
I connected it safely. ((´∀ `))
Let's also install the plugin for Japanese analysis.
# /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-kuromoji
# systemctl restart elasticsearch
# curl -X GET 'http://localhost:9200/_nodes/plugins?pretty'
…
"plugins" : [
{
"name" : "analysis-kuromoji",
"version" : "5.0.0",
"description" : "The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.",
"classname" : "org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin"
}
],
…
There is also a cloud version of Elastic Cloud (https://cloud.elastic.co/). (There is a 14-day free trial period.) Elastic uses an AWS server instead of his own server. GUI operation is completely convenient with a browser, but it is inconvenient if you do not put it on the back side (CUI) of the server.
First, I want to use kuromoji, so I will take in Twitter information and analyze it in Japanese. Secondly, I would like to use the map of kibana, so I will input the earthquake information.
I am not very familiar with Twitter collection, so I will proceed by referring to the contents of here.
# python -V
Python 3.5.2 :: Anaconda 4.1.1 (64-bit)
First, install the required packages.
# pip install twitter
# pip install elasticsearch
Like SQL, Elasticsearch also has to decide the data structure in advance and then input the data. Elasticsearch has an index-type-id structure. When inputting data, index and type must be specified, but id does not have to be specified. In this case, 22-digit UUIDs are automatically entered.
SQL | Mongo | Elastic |
---|---|---|
DB | DB | index |
table | collection | type |
You can map it using curl, but I find it more useful with kibana's Dev Tools.
PUT /twi_index
{
"settings":{
"index":{
"analysis":{
"tokenizer" : {
"kuromoji" : {
"type" : "kuromoji_tokenizer",
"mode" : "search"
}
},
"analyzer" : {
"japanese" : {
"type": "custom",
"tokenizer" : "kuromoji",
"filter" : ["pos_filter"]
}
},
"filter" : {
"pos_filter" : {
"type" : "kuromoji_part_of_speech",
"stoptags" :["conjunction","Particle","Particle-格Particle","Particle-格Particle-General","Particle-格Particle-Quote","Particle-格Particle-Collocation","Particle-接続Particle","Particle-係Particle","Particle-副Particle","Particle-Intermittent throwParticle","Particle-並立Particle","Particle-終Particle","Particle-副Particle/並立Particle/終Particle","Particle-Attributive","Particle-Adverbization","Particle-Special","Auxiliary verb","symbol","symbol-General","symbol-Comma","symbol-Kuten","symbol-Blank","symbol-Open parentheses","symbol-Parentheses closed","Other-Intermittent throw","Filler","Non-speech"]
}
}
}
}
},
"mappings": {
"twi_type":{
"properties":{
"created_at" : {
"type" : "date"
},
"text" : {
"type" : "text",
"analyzer": "japanese",
"fielddata": true
},
"track": {
"type": "keyword"
}
}
}
}
}
Define ʻanalyzer in
settings. Since Japanese analysis is required for
text to save Twitter contents, specify ʻanalyzer
.
We don't want to analyze track
, so type specifies keyword
.
For the setting method of kuromoji, refer to here.
Here, we use the kuromoji_part_of_speech
filter to exclude specific part of speech (particles, case particles, symbols).
Almost whole [yoppe's script](http://qiita.com/yoppe/items/3e61fd567ae1d4c40a96#%E3%83%84%E3%82%A4%E3%83%BC%E3%83%88%E3 % 81% AE% E5% 8F% 8E% E9% 9B% 86) is used. I put the slightly modified script on Github.
Let's check the data in Kinaba after a while. Import the data into kibana.
Select Discover --track. Visualize - pie chart - twi_index
On this day (16/11/11), "Red Pig" was broadcast on TV, so it became a hot topic on Twitter. (´∀ `)
GET _aliases
.
・ About Field data type
-In 2.x, the text uses the string
type, but from 5.x it is text
and keyword
.Keyword fields are only searchable by their exact value. If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use a keyword field.
-Text
is used when analysis is required, and keywors
is used when an exact match is required when searching.
Finally, I used kuromoji, so let's check the effect.
Without using kuromoji:
Use kuromoji:
It's subtle, but it's a little better. Well, there are many dedicated nouns in the information that is posted on Twitter, so it is difficult to analyze without defining a user dictionary.
The information source used is JSON API provided by P2P Earthquake Information.
PUT /earthquakes
{
"mappings": {
"earthquake": {
"properties": {
"time": {
"type": "date",
"format":"yyyy/MM/dd HH:mm:ssZ"
},
"place": {
"type": "text"
},
"location": {
"type": "geo_point"
},
"magnitude": {
"type": "float"
},
"depth": {
"type": "float"
}
}
}
}
}
I put the script on Github. Actually, it's the code I wrote half a year ago. You can do it, but there are some strange things. If I have time, I'll fix it.
Check it out in Kibana.
This is the end of the first part. I am thinking about what kind of data should be analyzed for the search and scoring of the second part.
Recommended Posts