Introduction

I tried to develop a web application using Watson Developer Cloud as a material for the in-house presentation, so it will be a summary and a memorandum.

About the app

Function (not saying it can be done)

If you enter a text or talk and ask a question, Watson will introduce you to the recommended restaurants from among the restaurants with a radius of about 1 km from a certain point. You can ask questions such as "I want to eat karaage set meal" and "Fish-based izakaya with a budget of up to 4000 yen".

The shops are displayed in a list format. The more ★, the more recommended restaurants. You can pin the shops you are interested in to the list, so decide which shop to visit while checking the information on the Gurunavi website and the distance and directions from your current location.

If you go to the store (I've been there a long time ago, but it doesn't matter), please comment on your impressions. Search results will be improved by analyzing and learning the content of comments, such as shops that can be recommended to other people if they are "very delicious" and shops that cannot be recommended if "the clerk was having a quarrel".

Diagram

At first, I thought about building everything on Bluemix, but the free frame of ClearDB wasn't there (5MB), and it also meant to prepare the development environment for my personal PC. ..

The REST API prepared on Node-RED is called from the client JS as appropriate to display search results and evaluate user comments.

Imagine the following operations.

Local R & R and NLC data creation / input
Logs are accumulated on ClearDB when the system is used by the user
When the log is collected, export it locally and create / input data again.
Improved search results (not to say)

Impressions

Retrieve and Rank -I just used it the other day I'm on the right track, so I accidentally started a high availability cluster (= specify the size when creating the cluster) Created) I have developed it. ――After working for a few days, I noticed it in a hurry and recreated it, but I was charged about 8,000 yen. --Create it as follows (empty cluster_size)

curl -k -X POST -u "**username:**password**" "https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters" -d "{\"cluster_size\":\"\",\"cluster_name\":\"WatsonRestaurantCluster\"}"

--The collection is as follows. --I'm just imitating my predecessors, but when I looked at the Tutorial when writing the article, I found that it was written differently (https://www.ibm.com/watson/developercloud/doc/retrieve-rank/configure.shtml). ) Was illustrated, so it may be wrong ...

`schema.xml`


   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

   <field name="shop_id" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="vote_id" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="shop_name" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="shop_name_kana" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="menu_name" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="menu_name_kana" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="latitude" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="longitude" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="shop_url" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="image_url" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 
   <field name="pr_text" type="string" indexed="false" stored="true" required="true" multiValued="false" /> 

   <field name="shop_text" type="watson_text_ja" indexed="false" stored="true" required="true" multiValued="false" /> 

   <field name="budget" type="int" indexed="true" stored="true" required="true" multiValued="false" />

`schema.xml`


  <fieldType name="watson_text_ja" indexed="true" stored="true" class="com.ibm.watson.hector.plugins.fieldtype.WatsonTextField">
      <analyzer type="index">
          <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt" />
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.JapaneseTokenizerFactory" userDictionary="lang/userdict_ja.txt"/>
          <filter class="solr.JapaneseBaseFormFilterFactory"/>
          <filter class="solr.CJKWidthFilterFactory"/>
          <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.JapaneseTokenizerFactory" mode="search" userDictionary="lang/userdict_ja.txt" />
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.JapaneseTokenizerFactory" userDictionary="lang/userdict_ja.txt"/>
          <filter class="solr.JapaneseBaseFormFilterFactory"/>
          <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
          <filter class="solr.CJKWidthFilterFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ja.txt"/>
          <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
  </fieldType>

How the document data was generated (Retrieve)

-Generate about 1500 cases based on GourNavi API

Document generation from the PR text of the Gurunavi store --One document for each store ID linked from API
Document generation from Gurunavi's support reviews --One document for store ID + support word-of-mouth ID
Document generation from system user comments --Document generation from store ID + user comments (only those positively evaluated by NLC)

`rar_documents.json`


{
  "id" : "5497472",
  "shop_id" : "5497472",
  "vote_id" : "",
  "shop_name" : "Tenkaippin Gotanda store",
  "shop_name_kana" : "Tenkaippinn Gotandaten",
  "menu_name" : "",
  "menu_name_kana" : "",
  "latitude" : "35.624377",
  "longitude" : "139.723394",
  "shop_url" : "http://r.gnavi.co.jp/b5tzzw2g0000/",
  "image_url" : "",
  "pr_text" : "It is a soup made by boiling chicken and several kinds of ingredients over time. It is full of collagen that is good for beauty and health.",
  "shop_text" : "Tenkaippin Gotanda store. Tenkaippinn Gotandaten. Ramen noodle dishes and others. Rich in collagen, which is good for beauty and health, it is a soup that you can never taste anywhere else. In addition, a set meal that includes half fried rice and Chinese soba. Gyoza set meal that includes gyoza, half rice, and Chinese soba. We also have a wide variety of menus such as half-fried rice, gyoza, and service set meals that include Chinese soba.",
  "budget" : -1
}

How the training data was generated (Rank)

--Create about 750 items at the initial migration stage (no learning data generated when the user uses the system)

Generate a question from a category and add points to the document --Additional points to documents containing "Italian" for the search "Italian"
Generate a question from the menu name and add points to the document --Additional points to documents containing "fried chicken set meal" for the search "fried chicken set meal"
Add points to the document for which detailed information was viewed --Additional points to documents for which a link to Google Map display or Gurunavi HP is pressed in the search results for a certain query searched by the user.
Additions and subtractions to the commented document --Points will be added if the result of NLC analysis of the comment is positive, and points will be deducted if the result is negative.
Add points to documents with high user ratings ――Among the search results of all the queries searched by the user, points are added to the shops with many cheering reviews in Gurunavi --Additional points based on NLC judgment results

`rar_training.csv`


"%E3%83%AF%E3%83%83%E3%83%91%E3%83%BC","6364602.681550","2"
"%E3%82%BF%E3%82%A4%E3%82%AC%E3%83%91%E3%82%AA","7255599","1","7255599.4618610","3"
"%E5%A1%A9%E3%83%AC%E3%83%A2%E3%83%B3%E3%82%AC%E3%83%91%E3%82%AA","e584801.1192601","2","6408790.4601796","2","geyc200.4614278","4","g044108.4609358","4","6085706.1451291","1"

Dictionary data (Solr function)

--I used Kuromoji neologd to generate the dictionary, but I can recognize various words compared to Kuromoji, and this is I felt that it could be used for free. ――The name of the shop was obtained from the Gurunavi API along with the reading, so I am using it.

Generate a user dictionary from store name, menu name, and category name ――I want Tenkaippin to appear in the top of the search with the word "rich ramen".
Generate a synonym dictionary so that the included category sub-category is searched when the category major classification is searched. --Make sure that documents containing "pizza" are searched when searching for "Italian".
Edit the stopword dictionary appropriately ――I think it's better to remove words such as "delicious" and "shop". -(I can't say anything because I haven't compared the results with or without ...)

`userdict_ja.txt`


Tenkaippin,Tenkaippin,Tenkaippin,Custom noun
Chicken taste bird,Chicken taste bird,Toridori Midori,Custom noun
Fishmaker Gotanda store,Fishmaker Gotanda store,Wosho Gotandaten,Custom noun
〆 Mackerel,〆 Mackerel,〆 Mackerel,Custom noun
Rafute,Rafute,Rafute,Custom noun

`synonyms.txt`


organic=>Medicinal food Organic food Vegetable food Organic
Creative creative cuisine=>Creative Japanese cuisine Creative cuisine Fusion cuisine
Italian=>  Italianイタリア料理 パスタ ピザ
French=> Frenchフランス料理 ビストロ

Natural Language Classifier

――It was much easier to use than R & R. (I'm not saying that it will be judged properly)

How to create the learning data for positive / negative judgment

--A total of about 1600 training data are generated for the two classes, positive and negative. ――I had the impression that R & R did not pick up negative words well (eg, "delicious" is caught in the search for "not delicious"), but it is judged correctly even in complicated terms (if covered by learning data). I found it interesting.

Positive (yummy) ――From the support reviews of Gurunavi API, reviews with high scores or many likes are divided into a certain length for each phrase and registered.
Negative (yacky) ――It was expected that it could be generated by the opposite of positive (low rating of cheering reviews), but negative ratings are hardly registered in reviews and it is confusing ... Is it censorship? ――After all, register by hand copy from the review blog of 2channel and restaurants.

`nlc_training.csv`


"Satisfied with the stomach","yummy"
"My son's favorite food","yummy"
"The dignity of the fresh craftsmen is also pleasant","yummy"
"Big tummy","yummy"
"Nostalgic taste","yummy"
"It was a terrible store","yacky"
"The freshness of the sashimi was bad","yacky"
"The quality is poor. subtle","yacky"
"Halfway and there is no good point","yacky"
"Not tasty. Cheesy","yacky"

Around the client

――It took more time to organize fonts and icons than R & R and NLC, and to understand bootstrap and Google Maps API introduced with interest. ――We narrowed down the search by budget by getting the amount from the search text and using Solr's filter function.

`Budget judgment`


function getBudgets(query) {
    var B_MIN = 0;
    var B_MAX = 100000;
    var B_RANGE = 500;
    var budgets = null;
    query = query.replace(/０-９/g, function(s) {
        return String.fromCharCode(s.charCodeAt(0) - 0xFEE0);
    });
    var matches = query.match(/\d+(?=Circle)|(From|that's all)|(Until|Within|Less than|Less than|Not enough)/g);
    if (matches) {
        var yens = matches.filter(function(y) {
            if (isFinite(y)) {
                return y > B_MIN && y < B_MAX
            } else {
                return false;
            }
        });
        if (yens.length == 1) {
            var condition;
            try {
                condition = matches[matches.indexOf(yens[0]) + 1];
            } catch (e) {
                condition = null;
            }
            yens = yens.map(function(y) {
                return Number(y.replace(/^0+/g, ""));
            });

            if (condition) {
                if (/From|that's all/g.test(condition)) {
                    budgets = {
                        budget_min: yens[0],
                        budget_max: B_MAX
                    };
                } else {
                    budgets = {
                        budget_min: B_MIN,
                        budget_max: yens[0]
                    };
                }
            } else {
                budgets = {
                    budget_min: yens[0] - B_RANGE,
                    budget_max: yens[0] + B_RANGE
                };
            }
        } else if (yens.length >= 2) {
            budgets = {
                budget_min: Math.min(yens[0], yens[1]),
                budget_max: Math.max(yens[0], yens[1])
            };
        }
    }
    if (budgets) {
        budgets.budget_min = Math.max(B_MIN, budgets.budget_min);
        budgets.budget_max = Math.min(B_MAX, budgets.budget_max);

        return budgets;
    } else {
        return null;
    }
}

Around the server, development environment

--Node-RED looks like the following. ―― ~~ R & R didn't work properly unless it was an HTTP node and NLC was a dedicated node ... ~~ I wrote the credentials in the node form in the first place, but when I checked it now, if I set the service connection on the Bluemix side, will it work as it is on the dedicated node?

――Java wasn't touched much at work, but it was fun to be able to do various things such as REST calls, JSON → DB, DB → JSON or CSV. --Eclipse Maven gives an error every time a reference is added or deleted, and when I think about it, it suddenly heals and I'm not sure. ――I feel that about 3/5 of the stress generated during work is caused by Maven ...

--When generating learning data for Ranker creation, if you try to add or subtract points for a document that is not returned for the search result of a certain search statement, an error will occur in train.py, so Java will search Solr and return it. We are selecting the documents to be used.

Miscellaneous notes

――It takes about 4 to 5 man-days to complete the year-end and New Year holidays. ――If you don't consider the free quota, is the maintenance fee about 5,000 yen / month? (Ranker1 instance 1000 yen + NLC1 instance 2000 yen + Node-RED ~ 1500 yen + learning API) --There is a free frame & if this is the only app you are creating, the maintenance fee can be reduced to 0 yen (should?) --It is said to be here during production operation, [Using a high availability cluster ...](https://www.google.co.jp/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8# q = (0.3 + * + 24 + * + 30)% E3% 83% 89% E3% 83% AB +% E6% 97% A5% E6% 9C% AC% E5% 86% 86) ――I'm not sure about the essential learning part (the mechanism of Ranker and NLC, how to make it better). ――Especially for R & R, the search results did not seem to be convincing. I think it's because Ranker generation is poor ... ――I would like to repair it soon if possible while paying attention to the charge.

Source code

I published it on Github.

[JAVA] Create a restaurant search app with IBM Watson + Gurunavi API (with source)

Introduction

About the app

Function (not saying it can be done)

Diagram

Impressions

schema.xml

schema.xml

How the document data was generated (Retrieve)

rar_documents.json

How the training data was generated (Rank)

rar_training.csv

Dictionary data (Solr function)

userdict_ja.txt

synonyms.txt

How to create the learning data for positive / negative judgment

nlc_training.csv

Around the client

Budget judgment

Around the server, development environment

Miscellaneous notes

Source code

`schema.xml`

`schema.xml`

`rar_documents.json`

`rar_training.csv`

`userdict_ja.txt`

`synonyms.txt`

`nlc_training.csv`

`Budget judgment`