Hello, this is the eighth day of ABEJA Advent Calendar.
An interesting service was announced at re: Invent 2019 the other day. Yes, enterprise search service [Amazon Kendra Release](https://aws.amazon.com/jp/about-aws/whats-new/2019/12/announcing-amazon-kendra-reinventing-enterprise-search-with -machine-learning /). There are various communication tools in the company. However, all of them tend to be flow information and are not organized, and there is a problem that information gaps are likely to occur. This is exactly Kendra's custom-made task. When it's announced, this has to be done! I was surprised. So I made a slack command that can search documents by natural sentences like this. Below, I will introduce how I made Kendra while touching it immediately.
Kendra is in preview as of December 8, 2019. Therefore, the services that can be used as data sources are narrowed down to the following.
So I came up with a way to enable natural text search by collecting in-house documents once in S3 and using Kendra. However, it is not always straightforward. Well, as I imagined, Kendra doesn't support Japanese. Therefore, I will try to solve it by combining it with AWS translation service Amazon Translate. If you use Translate to put the document into S3 and to query it, the IF can be in any language with only English processing internally. (I really want Kendra to support Japanese as soon as possible) Regarding IF, I chose slack because it is easy to implement commands because normal communication is slack.
Well, there is no moderation, but in summary, the architecture looks like this.
This is the story of the upper frame of the system configuration. Basically, all you have to do is download the specified document, translate it, and upload it to S3, but Translate limits the size of sentences that can be translated at one time, so we try to translate line by line. For that reason, it may be translated into slightly strange English, but since it is a verification, I will proceed without worrying about it. Also, when I executed the slack command, I wanted a link to the original document, so I am trying to link with the metadata of S3 Object.
def translate_and_upload(url, title, content):
client = boto3.client('translate')
bucket = boto3.resource('s3').Bucket('ynaka-kendra')
# Translate content to English.
translated = ''
for line in content.splitlines():
if len(line) > 0:
response = client.translate_text(
Text=line,
SourceLanguageCode='ja',
TargetLanguageCode='en'
)
translated += (response['TranslatedText'] + os.linesep)
else:
translated += os.linesep
# Upload to S3.
file_obj = io.BytesIO(translated.encode('utf-8'))
bucket.upload_fileobj(file_obj, title, ExtraArgs={'Metadata': {'url': url}})
This is the story of the lower frame of the system configuration. This is where Kendra, the protagonist of the day, appears.
First, we will build Kendra, which is the basis of this search. Basically, you can build it by following Getting Started with the Console.
First, generate Index. Creating a new IAM Role for Kendra takes about 30 seconds to generate. In addition, it takes about 30 minutes to generate the index, so let's take a long coffee break lol Next is the selection of the data source. This time, select S3. Specify the path of S3 to be linked. You can also set how often to sync, but this time I made it, so I chose "Run on demand". The setting review screen appears, Also, if you take a slightly longer coffee break, the construction is completed. I already have the documentation in S3, so I'll start syncing with Kendra. However, if this is left as it is, IAM will be insufficient and an error will occur. So I'll add Kendra's own privileges and S3's privileges. Sync is finally successful.
Cooperation with slack will be implemented with Slash Commands. It is a common API Gateway + Lambda configuration. Don't forget to add Kendra permissions to the role that runs Lambda. Also note that Kendra can only be used with the latest boto3
.
import boto3
import base64
import json
import logging
import os
import urllib
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def respond(err, res=None):
return {
'statusCode': '400' if err else '200',
'response_type': 'in_channel',
'body': err.message if err else json.dumps({
'response_type': 'in_channel',
'text': res
}),
'headers': {
'Content-Type': 'application/json',
},
}
def handler(event, context):
# Prepare clients.
translate = boto3.client('translate')
kendra = boto3.client('kendra')
s3 = boto3.resource('s3')
# Parse request.
body = event['body']
params = urllib.parse.parse_qs(base64.b64decode(body).decode('utf-8'))
token = params['token'][0]
if token != os.environ['VERIFY_TOKEN']:
logger.error(f'Request token ({token}) does not match expected.')
return respond(Exception('Invalid request token.'))
if 'text' not in params:
logger.error(f'The text should be included in command.')
return respond(Exception('Need text.'))
query = params['text'][0]
# Translate query to English.
response = translate.translate_text(
Text=query,
SourceLanguageCode='ja',
TargetLanguageCode='en'
)
query = response['TranslatedText']
# Find the most relevant document.
response = kendra.query(IndexId=os.environ['INDEX_ID'], QueryText=query)
# Create response message.
message = ''
for i, item in enumerate(response['ResultItems'][:3]):
title = item['DocumentTitle']['Text']
excerpt = item['DocumentExcerpt']['Text']
s3_uri = item['DocumentId'].replace('s3://', '').split('/')
url = s3.Object(s3_uri[0], '/'.join(s3_uri[1:])).metadata['url']
message_i = f'{i + 1}. *{title}*{os.linesep}url: {url}{os.linesep}```{excerpt}```{os.linesep}{os.linesep}'
message += message_i
# Return result.
return respond(None, message)
Kendra's SDK requires ʻIndexId`, so let's set it in an environment variable. The following three points are devised.
--Kendra only supports English, so I'm translating the query as well. ――Since there are situations where I was actually looking for the document that others wanted to find, I posted it on ʻin_channel`. --Since there are many English members, some of the translated documents hit by the search are also displayed.
You can now search by natural sentences like this using Kendra + Translate. In response to the question "What is ABEJA's skill map?", The first answer is the skill map that is being organized, and the second is the recently released Technology Stack. It seems that it works like that when I publish an article about / tech-stack-201911). As expected, part of the document is blurred, but please feel the atmosphere.
By the way, of course, you can also search in English. In Kendra, different places in the same document may be hit, so it seems necessary to consider how to handle this based on what you want to realize with commands.
Also, the reaction within the company is good and I am very happy.
I immediately tried using Amazon Kendra announced at re: Invent 2019 the other day and made an in-house document search slack command. Although the waiting time was long and it took a long time to solve due to a mystery error, I felt that Kendra is generally easy to use and it is a very good service to be able to search in natural sentences. This time, I tried to make it using only some documents in the company due to the relationship between time and cost, but I can collect all kinds of information and search from slack, learn by seeing the results by others, and search history It seems that you can draw a world view where information distribution is visualized and information is structured, and your dreams will spread. I wonder if they will support Japanese as soon as possible. By the way, please note that an Internal Server Error may be returned if you insert an unexpected document such as a Japanese document into Kendra. However, I got the courage to release it even if there was an error.
Recommended Posts