We have summarized the chatbots using the Microsoft Cognitive Toolkit (CNTK).
Part 1 prepares you to train your chatbot using the Microsoft Cognitive Toolkit.
I will introduce them in the following order.
It is assumed that you have a Twitter account and that you have Linux available.
Go to the page above and click Apply in the upper right corner.
Click Apply for a developer account and select the Twitter account you want to use.
Then select some basic items. As you go through the process, you will be asked to write about 100 to 200 characters each, such as what purpose you will use it and how you will use the Twitter API and data, but even simple English is a problem. It seems that there is no such thing, so let's honestly write the purpose of use.
Finally, when you confirm the registration details and apply to Twitter Developer, the application acceptance email will be sent to the email address linked to the Twitter account used for registration, so let's wait for the examination result. You will receive an email of the examination result in about a day at the earliest.
Also, at this stage, you may receive an email from Twitter Developer again and ask for the usage of the Twitter API in Japanese, but it seems that there is no problem if you translate the content written in English obediently into Japanese. I will.
Next, get the API key, API secret key, Access token, and Access token secret to get the tweets and replies.
Click Apps-> Create an app and you will be asked for the name of the app you want to create, the website and how you want to use it.
Once you've created your app, you'll see your API key and API secret key in Keys and tokens on the details page for your app.
Make a note of the Access token and Access token secret as they will only be displayed once by pressing the generate button.
Once you have registered with Twitter Developer and obtained the API key, API secret key, Access token, and Access token secret, you will be able to collect tweets and replies.
Install tweepy to work with the Twitter API in Python. The Twitter API has various functions, but this time I decided to collect the posted tweets and one reply to them as one conversation and save it as a text file of up to 100 conversations at a time.
The directory structure this time is as follows.
Doc2Vec STSA |―twitter |―20200401_000000.txt |―... stsa_corpus.py stsa_twitter.py Word2Vec
After collecting the tweet and reply conversation dataset, preprocess the text data.
This time I did a text cleaning using the Python standard modules re and emoji.
In Word2Vec and Doc2Vec, MeCab is divided into words based on the NEologd dictionary. I ran it to create a word dictionary, but this time I will create a subword model using sentencepiece.
After converting to word IDs using the Sentence Piece model trained with the training data, we are ready to create a text file for the CTFDeserializer to be used for training chatbots.
CTFDeserializer is introduced in Computer Vision: Image Caption Part1 --STAIR Captions.
-CPU Intel (R) Core (TM) i7-9750H 2.60GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Emoji 0.5.4 ・ Sentencepiece 0.1.86 ・ Tweepy 3.8.0
The implemented program is published on GitHub.
stsa_twitter.py
stsa_corpus.py
I will extract and supplement some parts of the program to be executed.
Text cleaning is important because the tweets and replies we collect contain a lot of noise.
text_cleaning
s = re.sub(r"http\S+", "", s) # remove https
s = re.sub(r"\@[a-z0-9-_][a-z0-9-_]*", "", s) # remove @tweet
s = re.sub(r"\#.+", "", s) # remove #tag
text_clearning
s = re.sub("[˗֊‐‑‒–⁃⁻₋−]+", "-", s) # normalize hyphens
s = re.sub("[﹣ ----─━ -]+", "-", s) # normalize choonpus
s = re.sub("[~∼∾〜〰~]", "", s) # remove tildes
s = s.lower() # normalize alphabet to lowercase
s = s.translate({ord(x): ord(y) for x, y in zip( # normalize half-width symbols to full-width symbols
"!\"#$%&'()*+,-./:;<=>?@[¥]^_`{|}~¡, ・ """,
"!! "# $% &'() * +,-./ :; <=>? @ [\] ^ _` {|} ~., ・ """)})
text_clearning
s = re.sub(r"!+", "!", s)
s = re.sub(r"??+", "?", s)
s = re.sub(r"…+", "…", s)
s = re.sub(r"w+w", "。", s)
text_clearning
s = "".join(["" if c in emoji.UNICODE_EMOJI else c for c in s]) # remove emoji
s = re.sub(r"(.*).*", "", s) # remove kaomoji
This isn't perfect, but the results are fair.
First, clone SentencePiece from GitHub and build it. The following is running on a Linux distribution.
$ git clone https://github.com/google/sentencepiece.git
$ sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
$ cd sentencepiece
$ mkdir build
$ cd build
$ cmake ..
$ make -j $(nproc)
$ sudo make install
$ sudo ldconfig -v
At the end of the training, twitter.model and twitter.vocab will be created. By default, the Sentence Piece model is assigned 0, 1, and 2 for unknown words, start, and end, respectively.
In the saved tweet and reply text file, the tweet and reply are in one conversation with two lines.
Text cleaning can result in 0 words, so you need to ensure that at least one word is present.
stsa_preprocessing
for idx in range(len(corpus) // 2):
tweet = text_cleaning(corpus[2 * idx])
reply = text_cleaning(corpus[2 * idx + 1])
if len(tweet) == 0 or len(reply) == 0:
continue
else:
f.write("%s\n" % tweet)
f.write("%s\n" % reply)
Also, since tweets and replies may have different series lengths, the length of tweets and replies is adjusted using the zip_longest of the Python standard module itertools to create a text file to be read by the built-in reader.
stsa_sentencepiece
with open("./tweet_reply_corpus.txt", "w") as ctf_file:
for i, (tweet, reply, target) in enumerate(tweet_reply):
for (twt, rep) in zip_longest(tweet, reply, fillvalue=""):
if twt == "":
ctf_file.write("{} |reply {}:1\n".format(i, rep))
elif rep == "":
ctf_file.write("{} |tweet {}:1\n".format(i, twt))
else:
ctf_file.write("{} |tweet {}:1\t|reply {}:1\n".format(i, twt, rep))
First, run stsa_twitter.py to collect tweets and replies.
1 : @user_id ...
2 : @user_id ...
...
The code published on GitHub does not handle exceptions completely, so if you run it as it is, it will stop every 24 hours, so please check it every half day.
After collecting the required amount of data, the function stsa_preprocessing will generate a Twitter conversation corpus twitter.txt consisting of a text-cleaned tweet-reply pair.
Number of tweet and reply: 1068294
Then train the Sentence Piece model. Training starts by setting the arguments as shown below. I set the number of words to 32,000.
$ spm_train --input=/mnt/c/.../twitter.txt --model_prefix=twitter --vocab_size=32000
At the end of the training, twitter.model and twitter.vocab will be created.
Finally, execute the function stsa_sentencepiece to create a text file to be read by CTFDeserializer.
Now 10000 samples...
Now 20000 samples...
...
Now 970000 samples...
Number of samples 973124
We were able to collect about 1 million conversations in about a month, but the actual data available was 973,124 conversations.
Now that you're ready to train, Part 2 will use CNTK to train your chatbot.
Twitter - Developer sentencepiece
Computer Vision : Image Caption Part1 - STAIR Captions Natural Language : Word2Vec Part1 - Japanese Corpus
Recommended Posts