Generate a random sentence from your tweet with trigram

It is a record. The explanation is written only lightly.

manner

Preparation

Download your tweet history. On the Twitter page, make a request by selecting "Settings & Privacy" → "Account" → "Twitter data" → "Download Twitter data", and after a while, a download link will be sent to your email address, so download from there. Since the summer of 2019, the specifications of the download data have changed and it seems that tweets.csv has becometweets.js, so it is troublesome, so using the tool written by another person tweets.csv Make a . (Https://17number.github.io/tweet-js-loader/)

Create a text folder in the same directory as your workplace and throw tweets.csv into it and you're ready to go.

Next, about the contents of tweet.py. First, create tweets.txt in the following part. The tweet body is extracted from tweets.csv and made into a txt file.

`tweet.py`


import csv
import re

rawfile = "text/tweets.csv"
infile = "text/tweets.txt"
outfile = "text/tweets_wakati.txt"


with open(rawfile,'r') as f:
    reader = csv.reader(f)
    
    with open(infile,'w') as f:
        for d in reader:
            if len(d) > 2:
                f.write(d[2])
            f.write('\n')

Next, use janome to divide the words and train the model. If you don't have janome, please do pip install janome first. By the way, I try to make sentences that make sense in Japanese as much as possible by eliminating alphabets, specific symbols, question boxes, etc.

`tweet.py`


from janome.tokenizer import Tokenizer
t = Tokenizer()


with open(infile,'r') as f:
    data = f.readlines()

p = re.compile('[a-z]+')
p2 = re.compile('[:/.@#Question ●]+')

with open(outfile,'w') as f:
    for i in range(len(data)):
        line = data[i]
        if p2.search(line):
            pass
        else:
            for token in t.tokenize(line):
                if p.search(str(token.surface)):
                    pass
                else:
                    f.write(str(token.surface))
                    f.write(' ')
            f.write('\n')
        


words = []
for l in open(outfile, 'r', encoding='utf-8').readlines():
    if len(l) > 1:
        words.append(('<BOP> <BOP> ' + l + ' <EOP>').split())


from nltk.lm import Vocabulary
from nltk.lm.models import MLE
from nltk.util import ngrams

vocab = Vocabulary([item for sublist in words for item in sublist])

print('Vocabulary size: ' + str(len(vocab)))

text_trigrams = [ngrams(word, 3) for word in words]

n = 3
lm = MLE(order = n, vocabulary = vocab)
lm.fit(text_trigrams)

Finally, random sentence generation.

`tweets.py`


for j in range(10):
    # context = ['<BOP>']
    context = ['<BOP>','<BOP>']
    sentence = ''
    for i in range(0, 100):
        #Randomly select a non-zero probability of connecting from the last two words in the context
        w = lm.generate(text_seed=context)

        if '<EOP>' == w or '\n' == w:
            break

        context.append(w)
        sentence += w

        
    
    print(sentence+'\n')

10 sentences are output at random. When copying and pasting, combine the code written above into one file, or divide it into cells with jupyter notebook and execute it. The latter method is recommended as it may take some time to train the model.

result

You should see 10 sentences output like the one above. It's pretty interesting so you can try it endlessly. Please try it out.