I tried to make a document search slack command using Kendra announced at re: Invent 2019.

Hello, this is the eighth day of ABEJA Advent Calendar.

Introduction

An interesting service was announced at re: Invent 2019 the other day. Yes, enterprise search service [Amazon Kendra Release](https://aws.amazon.com/jp/about-aws/whats-new/2019/12/announcing-amazon-kendra-reinventing-enterprise-search-with -machine-learning /). Screenshot from 2019-12-08 07-56-38.png There are various communication tools in the company. However, all of them tend to be flow information and are not organized, and there is a problem that information gaps are likely to occur. This is exactly Kendra's custom-made task. When it's announced, this has to be done! I was surprised. So I made a slack command that can search documents by natural sentences like this. Screenshot_2019-12-08 Slack yuta nakagawa ABEJA, Inc .png Below, I will introduce how I made Kendra while touching it immediately.

System configuration

Kendra is in preview as of December 8, 2019. Therefore, the services that can be used as data sources are narrowed down to the following.

Amazon S3
SharePoint Online
Amazon RDS

So I came up with a way to enable natural text search by collecting in-house documents once in S3 and using Kendra. However, it is not always straightforward. Well, as I imagined, Kendra doesn't support Japanese. Screenshot_2019-12-08 Amazon Kendra FAQs - Amazon Web Services.png Therefore, I will try to solve it by combining it with AWS translation service Amazon Translate. If you use Translate to put the document into S3 and to query it, the IF can be in any language with only English processing internally. (I really want Kendra to support Japanese as soon as possible) Regarding IF, I chose slack because it is easy to implement commands because normal communication is slack.

Well, there is no moderation, but in summary, the architecture looks like this.

Document preparation

This is the story of the upper frame of the system configuration. Basically, all you have to do is download the specified document, translate it, and upload it to S3, but Translate limits the size of sentences that can be translated at one time, so we try to translate line by line. For that reason, it may be translated into slightly strange English, but since it is a verification, I will proceed without worrying about it. Also, when I executed the slack command, I wanted a link to the original document, so I am trying to link with the metadata of S3 Object.

def translate_and_upload(url, title, content):
    client = boto3.client('translate')
    bucket = boto3.resource('s3').Bucket('ynaka-kendra')
    # Translate content to English.
    translated = ''
    for line in content.splitlines():
        if len(line) > 0:
            response = client.translate_text(
                Text=line,
                SourceLanguageCode='ja',
                TargetLanguageCode='en'
            )
            translated += (response['TranslatedText'] + os.linesep)
        else:
            translated += os.linesep
    # Upload to S3.
    file_obj = io.BytesIO(translated.encode('utf-8'))
    bucket.upload_fileobj(file_obj, title, ExtraArgs={'Metadata': {'url': url}})

Document search

This is the story of the lower frame of the system configuration. This is where Kendra, the protagonist of the day, appears.

Building Kendra

First, we will build Kendra, which is the basis of this search. Basically, you can build it by following Getting Started with the Console.

First, generate Index. Screenshot_2019-12-07 Kendra.png Creating a new IAM Role for Kendra takes about 30 seconds to generate. Screenshot_2019-12-07 Kendra (1).png In addition, it takes about 30 minutes to generate the index, so let's take a long coffee break lol Screenshot_2019-12-07 Kendra(1).png Next is the selection of the data source. This time, select S3. Screenshot_2019-12-07 Kendra(2).png Screenshot_2019-12-07 Kendra(3).png Specify the path of S3 to be linked. You can also set how often to sync, but this time I made it, so I chose "Run on demand". Screenshot_2019-12-07 Kendra(4).png The setting review screen appears, Screenshot_2019-12-07 Kendra(5).png Also, if you take a slightly longer coffee break, the construction is completed. Screenshot_2019-12-07 Kendra(6).png Screenshot_2019-12-07 Kendra(7).png I already have the documentation in S3, so I'll start syncing with Kendra. Screenshot_2019-12-07 Kendra(8).png However, if this is left as it is, IAM will be insufficient and an error will occur. Screenshot_2019-12-07 Kendra(9).png So I'll add Kendra's own privileges and S3's privileges. Screenshot_2019-12-07 IAM Management Console(4).png Sync is finally successful. Screenshot_2019-12-07 Kendra(11).png

Document search API

Cooperation with slack will be implemented with Slash Commands. It is a common API Gateway + Lambda configuration. Don't forget to add Kendra permissions to the role that runs Lambda. Also note that Kendra can only be used with the latest boto3.

import boto3
import base64
import json
import logging
import os
import urllib


logger = logging.getLogger()
logger.setLevel(logging.INFO)


def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'response_type': 'in_channel',
        'body': err.message if err else json.dumps({
            'response_type': 'in_channel',
            'text': res 
        }), 
        'headers': {
            'Content-Type': 'application/json',
        },  
    }   


def handler(event, context):
    # Prepare clients.
    translate = boto3.client('translate')
    kendra = boto3.client('kendra')
    s3 = boto3.resource('s3')
    # Parse request.
    body = event['body']
    params = urllib.parse.parse_qs(base64.b64decode(body).decode('utf-8'))
    token = params['token'][0]
    if token != os.environ['VERIFY_TOKEN']:
        logger.error(f'Request token ({token}) does not match expected.')
        return respond(Exception('Invalid request token.'))
    if 'text' not in params:
        logger.error(f'The text should be included in command.')
        return respond(Exception('Need text.'))
    query = params['text'][0]
    # Translate query to English.
    response = translate.translate_text(
        Text=query,
        SourceLanguageCode='ja',
        TargetLanguageCode='en'
    )
    query = response['TranslatedText']
    # Find the most relevant document.
    response = kendra.query(IndexId=os.environ['INDEX_ID'], QueryText=query)
    # Create response message.
    message = ''
    for i, item in enumerate(response['ResultItems'][:3]):
        title = item['DocumentTitle']['Text']
        excerpt = item['DocumentExcerpt']['Text']
        s3_uri = item['DocumentId'].replace('s3://', '').split('/')
        url = s3.Object(s3_uri[0], '/'.join(s3_uri[1:])).metadata['url']
        message_i = f'{i + 1}. *{title}*{os.linesep}url: {url}{os.linesep}```{excerpt}```{os.linesep}{os.linesep}'
        message += message_i
    # Return result.
    return respond(None, message)

Kendra's SDK requires ʻIndexId`, so let's set it in an environment variable. The following three points are devised.

--Kendra only supports English, so I'm translating the query as well. ――Since there are situations where I was actually looking for the document that others wanted to find, I posted it on ʻin_channel`. --Since there are many English members, some of the translated documents hit by the search are also displayed.

result

You can now search by natural sentences like this using Kendra + Translate. Screenshot_2019-12-08 Slack yuta nakagawa ABEJA, Inc .png In response to the question "What is ABEJA's skill map?", The first answer is the skill map that is being organized, and the second is the recently released Technology Stack. It seems that it works like that when I publish an article about / tech-stack-201911). As expected, part of the document is blurred, but please feel the atmosphere.

By the way, of course, you can also search in English. Screenshot_2019-12-08 Slack yuta nakagawa ABEJA, Inc (2).png In Kendra, different places in the same document may be hit, so it seems necessary to consider how to handle this based on what you want to realize with commands.

Also, the reaction within the company is good and I am very happy. Screenshot_2019-12-08 Slack proj_tech_branding ABEJA, Inc .png

Summary

I immediately tried using Amazon Kendra announced at re: Invent 2019 the other day and made an in-house document search slack command. Although the waiting time was long and it took a long time to solve due to a mystery error, I felt that Kendra is generally easy to use and it is a very good service to be able to search in natural sentences. This time, I tried to make it using only some documents in the company due to the relationship between time and cost, but I can collect all kinds of information and search from slack, learn by seeing the results by others, and search history It seems that you can draw a world view where information distribution is visualized and information is structured, and your dreams will spread. I wonder if they will support Japanese as soon as possible. By the way, please note that an Internal Server Error may be returned if you insert an unexpected document such as a Japanese document into Kendra. However, I got the courage to release it even if there was an error. Screenshot_2019-12-04 Kendra(21).png