This article is the 24th day of Elastic stack Advent Calendar 2016.
――I am an engineer at a company called Acroquest Technology. ――When I was a student, I was doing natural language processing and information retrieval.
――It's like spitting out answers to questions written in Japanese.
--There are many sources in the world that can be sources of knowledge. Ideally, you should be able to choose flexibly if possible, and the information should be constantly updated.
――I think Elasticsearch, which is easy to scale, may be good when the knowledge source becomes huge. (Of course, I'm not motivated to form a large cluster individually)
--Write in multiple articles. → In this article, I will write about what I have tried as a preliminary preparation.
First of all, the definition of "question answering" is fluffy. The range of difficulty varies greatly depending on the type of question, so This time, as the first step, we will focus on the "authenticity judgment problem," which seems to be the simplest.
For example, in response to a sentence such as "Ieyasu Tokugawa opened the Edo Shogunate" that expresses a specific fact. It is for authenticity judgment.
If this
In theory.
This time, we will create some sample data and put it into Elasticsearch. For the time being, I tried to insert the text data itself and the one that was divided. (I'd be happy if I could see the keywords visually, I'm leaving the text because I think I'll want to parse it later)
As a data flow Data source → Python → elasticsearch → Python → Output I think it is better to do.
It's not directly related to what you want to do this time, but it's fun to use Graph if you bring it in an array
Graph I thought it didn't make sense this time, but it was rather important ... If you look at this, you can see at a glance that "19" and "century" appear separately and the mysterious word "ka" is extracted. (What is "ka" ...) It's not wrong as a process, but I'd be happy if "○○ Century" was a set. It looks like we need to improve the way we divide words. I will review the dictionary separately.
For the time being, make sure that you can search from the python side with an appropriate keyword.
ruby::
from elasticsearch import Elasticsearch
import json
es = Elasticsearch(['http://USER:PASSWORD@localhost:9200'])
request_body="{\"size\":10,\"query\":{\"term\":{\"words.keyword\":\"Japan\"}}}"
output = open("search_result.json","w")
json.dump(es.search(index="test",body=request_body),output, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
If you write like this, documents containing the word "Japan" will be pulled. The result will be returned as ↓
When it comes to authenticity After that, it seems that it can be judged by analyzing the question text and the text of the returned document. For the time being, I will do the preparation so far this time. Please look forward to the continuation.
For the time being, I prepared to make a question answering system. (Maybe it's just preparations ...)
In the next article, I want to make a child who can answer the question properly.
Recommended Posts