Natural Language: ChatBot Part1-Twitter API Corpus

Target

We have summarized the chatbots using the Microsoft Cognitive Toolkit (CNTK).

Part 1 prepares you to train your chatbot using the Microsoft Cognitive Toolkit.

I will introduce them in the following order.

  1. Register for Twitter Developer
  2. Collecting conversation datasets using the Twitter API
  3. Text data preprocessing and Sentence Piece model creation
  4. Creating a file to be read by the built-in reader provided by CNTK

Introduction

It is assumed that you have a Twitter account and that you have Linux available.

Twitter Developer Registration

Apply for Twitter Developer

Twitter - Developer

Go to the page above and click Apply in the upper right corner.

Click Apply for a developer account and select the Twitter account you want to use.

Then select some basic items. As you go through the process, you will be asked to write about 100 to 200 characters each, such as what purpose you will use it and how you will use the Twitter API and data, but even simple English is a problem. It seems that there is no such thing, so let's honestly write the purpose of use.

Finally, when you confirm the registration details and apply to Twitter Developer, the application acceptance email will be sent to the email address linked to the Twitter account used for registration, so let's wait for the examination result. You will receive an email of the examination result in about a day at the earliest.

Also, at this stage, you may receive an email from Twitter Developer again and ask for the usage of the Twitter API in Japanese, but it seems that there is no problem if you translate the content written in English obediently into Japanese. I will.

Get API key and Access token

Next, get the API key, API secret key, Access token, and Access token secret to get the tweets and replies.

Click Apps-> Create an app and you will be asked for the name of the app you want to create, the website and how you want to use it.

Once you've created your app, you'll see your API key and API secret key in Keys and tokens on the details page for your app.

Make a note of the Access token and Access token secret as they will only be displayed once by pressing the generate button.

Collecting conversation datasets using the Twitter API

Once you have registered with Twitter Developer and obtained the API key, API secret key, Access token, and Access token secret, you will be able to collect tweets and replies.

Install tweepy to work with the Twitter API in Python. The Twitter API has various functions, but this time I decided to collect the posted tweets and one reply to them as one conversation and save it as a text file of up to 100 conversations at a time.

The directory structure this time is as follows.

Doc2Vec STSA  |―twitter   |―20200401_000000.txt   |―...  stsa_corpus.py  stsa_twitter.py Word2Vec

Text data preprocessing and Sentence Piece model creation

After collecting the tweet and reply conversation dataset, preprocess the text data.

This time I did a text cleaning using the Python standard modules re and emoji.

In Word2Vec and Doc2Vec, MeCab is divided into words based on the NEologd dictionary. I ran it to create a word dictionary, but this time I will create a subword model using sentencepiece.

Creating a file to be read by the built-in reader provided by CNTK

After converting to word IDs using the Sentence Piece model trained with the training data, we are ready to create a text file for the CTFDeserializer to be used for training chatbots.

CTFDeserializer is introduced in Computer Vision: Image Caption Part1 --STAIR Captions.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-9750H 2.60GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Emoji 0.5.4 ・ Sentencepiece 0.1.86 ・ Tweepy 3.8.0

Program to run

The implemented program is published on GitHub.

stsa_twitter.py


stsa_corpus.py


Commentary

I will extract and supplement some parts of the program to be executed.

Text cleaning

Text cleaning is important because the tweets and replies we collect contain a lot of noise.

Removal of Twitter-specific http links, @user IDs, and #tags

text_cleaning


s = re.sub(r"http\S+", "", s)  # remove https
s = re.sub(r"\@[a-z0-9-_][a-z0-9-_]*", "", s)  # remove @tweet
s = re.sub(r"\#.+", "", s)  # remove #tag

Normalization of various symbols

text_clearning


s = re.sub("[˗֊‐‑‒–⁃⁻₋−]+", "-", s)  # normalize hyphens
s = re.sub("[﹣ ----─━ -]+", "-", s)  # normalize choonpus
s = re.sub("[~∼∾〜〰~]", "", s)  # remove tildes

s = s.lower()  # normalize alphabet to lowercase
s = s.translate({ord(x): ord(y) for x, y in zip(  # normalize half-width symbols to full-width symbols
    "!\"#$%&'()*+,-./:;<=>?@[¥]^_`{|}~¡, ・ """,
    "!! "# $% &'() * +,-./ :; <=>? @ [\] ^ _` {|} ~., ・ """)})

Redundancy reduction

text_clearning


s = re.sub(r"!+", "!", s)
s = re.sub(r"??+", "?", s)
s = re.sub(r"…+", "…", s)
s = re.sub(r"w+w", "。", s)

Remove emojis and emoticons

text_clearning


s = "".join(["" if c in emoji.UNICODE_EMOJI else c for c in s])  # remove emoji
s = re.sub(r"(.*).*", "", s)  # remove kaomoji

This isn't perfect, but the results are fair.

Sentence Piece model training

First, clone SentencePiece from GitHub and build it. The following is running on a Linux distribution.

$ git clone https://github.com/google/sentencepiece.git

$ sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

$ cd sentencepiece
$ mkdir build
$ cd build
$ cmake ..
$ make -j $(nproc)
$ sudo make install
$ sudo ldconfig -v

At the end of the training, twitter.model and twitter.vocab will be created. By default, the Sentence Piece model is assigned 0, 1, and 2 for unknown words, start, and end, respectively.

Tweet and reply conversation dataset

In the saved tweet and reply text file, the tweet and reply are in one conversation with two lines.

Text cleaning can result in 0 words, so you need to ensure that at least one word is present.

stsa_preprocessing


for idx in range(len(corpus) // 2):
    tweet = text_cleaning(corpus[2 * idx])
    reply = text_cleaning(corpus[2 * idx + 1])

    if len(tweet) == 0 or len(reply) == 0:
        continue
    else:
        f.write("%s\n" % tweet)
        f.write("%s\n" % reply)

Also, since tweets and replies may have different series lengths, the length of tweets and replies is adjusted using the zip_longest of the Python standard module itertools to create a text file to be read by the built-in reader.

stsa_sentencepiece


with open("./tweet_reply_corpus.txt", "w") as ctf_file:
    for i, (tweet, reply, target) in enumerate(tweet_reply):
        for (twt, rep) in zip_longest(tweet, reply, fillvalue=""):
            if twt == "":
                ctf_file.write("{} |reply {}:1\n".format(i, rep))
            elif rep == "":
                ctf_file.write("{} |tweet {}:1\n".format(i, twt))
            else:
                ctf_file.write("{} |tweet {}:1\t|reply {}:1\n".format(i, twt, rep))

result

First, run stsa_twitter.py to collect tweets and replies.

  1 : @user_id ...
  2 : @user_id ...
...

The code published on GitHub does not handle exceptions completely, so if you run it as it is, it will stop every 24 hours, so please check it every half day.

After collecting the required amount of data, the function stsa_preprocessing will generate a Twitter conversation corpus twitter.txt consisting of a text-cleaned tweet-reply pair.

Number of tweet and reply: 1068294

Then train the Sentence Piece model. Training starts by setting the arguments as shown below. I set the number of words to 32,000.

$ spm_train --input=/mnt/c/.../twitter.txt --model_prefix=twitter --vocab_size=32000

At the end of the training, twitter.model and twitter.vocab will be created.

Finally, execute the function stsa_sentencepiece to create a text file to be read by CTFDeserializer.

Now 10000 samples...
Now 20000 samples...
...
Now 970000 samples...

Number of samples 973124

We were able to collect about 1 million conversations in about a month, but the actual data available was 973,124 conversations.

Now that you're ready to train, Part 2 will use CNTK to train your chatbot.

reference

Twitter - Developer sentencepiece

Computer Vision : Image Caption Part1 - STAIR Captions Natural Language : Word2Vec Part1 - Japanese Corpus

Recommended Posts

Natural Language: ChatBot Part1-Twitter API Corpus
Natural Language: Word2Vec Part1 --Japanese Corpus
Natural Language: ChatBot Part2-Sequence To Sequence Attention
Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus
Python: Natural language processing
RNN_LSTM2 Natural language processing
Natural language processing 3 Word continuity
Python: Natural language vector representation
Natural language processing 2 Word similarity
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api