This article is the 24th day of Elastic stack Advent Calendar 2016.

Self-introduction

――I am an engineer at a company called Acroquest Technology. ――When I was a student, I was doing natural language processing and information retrieval.

Overview

I want to create a question answering system based on Elasticsearch.

――It's like spitting out answers to questions written in Japanese.

--There are many sources in the world that can be sources of knowledge. Ideally, you should be able to choose flexibly if possible, and the information should be constantly updated.

――I think Elasticsearch, which is easy to scale, may be good when the knowledge source becomes huge. (Of course, I'm not motivated to form a large cluster individually)

--Write in multiple articles. → In this article, I will write about what I have tried as a preliminary preparation.

environment

mac OS Sierra 10.12.2
elasticsearch-5.1.1
kibana-5.1.1
python-3.5.2
mecab 0.996

This flow

Write policy
Put the data that will be the source of knowledge into elasticsearch for the time being
Make it possible to get related documents from the Python side

policy

First of all, the definition of "question answering" is fluffy. The range of difficulty varies greatly depending on the type of question, so This time, as the first step, we will focus on the "authenticity judgment problem," which seems to be the simplest.

For example, in response to a sentence such as "Ieyasu Tokugawa opened the Edo Shogunate" that expresses a specific fact. It is for authenticity judgment.

If this

Knowledge source data is held correctly
Can interpret the question
You can search for the correct information from knowledge sources If the conditions are met, you should be able to answer correctly in 100 shots.

In theory.

Put the data that will be the source of knowledge into elasticsearch for the time being

This time, we will create some sample data and put it into Elasticsearch. For the time being, I tried to insert the text data itself and the one that was divided. (I'd be happy if I could see the keywords visually, I'm leaving the text because I think I'll want to parse it later)

As a data flow Data source → Python → elasticsearch → Python → Output I think it is better to do.

The data that went into it looks like this

スクリーンショット 2016-12-24 19.25.41.png

It's not directly related to what you want to do this time, but it's fun to use Graph if you bring it in an array

スクリーンショット 2016-12-24 21.08.13.png

Graph I thought it didn't make sense this time, but it was rather important ... If you look at this, you can see at a glance that "19" and "century" appear separately and the mysterious word "ka" is extracted. (What is "ka" ...) It's not wrong as a process, but I'd be happy if "○○ Century" was a set. It looks like we need to improve the way we divide words. I will review the dictionary separately.

For the time being, make sure that you can search from the python side with an appropriate keyword.

`ruby::`


from elasticsearch import Elasticsearch
import json

es = Elasticsearch(['http://USER:PASSWORD@localhost:9200'])


request_body="{\"size\":10,\"query\":{\"term\":{\"words.keyword\":\"Japan\"}}}"

output = open("search_result.json","w")
json.dump(es.search(index="test",body=request_body),output, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))

If you write like this, documents containing the word "Japan" will be pulled. The result will be returned as ↓ スクリーンショット 2016-12-24 21.51.06.png

When it comes to authenticity After that, it seems that it can be judged by analyzing the question text and the text of the returned document. For the time being, I will do the preparation so far this time. Please look forward to the continuation.

Summary

For the time being, I prepared to make a question answering system. (Maybe it's just preparations ...)

In the next article, I want to make a child who can answer the question properly.

Try using Elasticsearch as the foundation of a question answering system