It is a record. The explanation is written only lightly.
Download your tweet history. On the Twitter page, make a request by selecting "Settings & Privacy" → "Account" → "Twitter data" → "Download Twitter data", and after a while, a download link will be sent to your email address, so download from there.
Since the summer of 2019, the specifications of the download data have changed and it seems that tweets.csv
has becometweets.js
, so it is troublesome, so using the tool written by another person tweets.csv Make a
. (Https://17number.github.io/tweet-js-loader/)
Create a text folder in the same directory as your workplace and throw tweets.csv
into it and you're ready to go.
Next, about the contents of tweet.py
.
First, create tweets.txt
in the following part. The tweet body is extracted from tweets.csv
and made into a txt file.
tweet.py
import csv
import re
rawfile = "text/tweets.csv"
infile = "text/tweets.txt"
outfile = "text/tweets_wakati.txt"
with open(rawfile,'r') as f:
reader = csv.reader(f)
with open(infile,'w') as f:
for d in reader:
if len(d) > 2:
f.write(d[2])
f.write('\n')
Next, use janome to divide the words and train the model. If you don't have janome, please do pip install janome
first.
By the way, I try to make sentences that make sense in Japanese as much as possible by eliminating alphabets, specific symbols, question boxes, etc.
tweet.py
from janome.tokenizer import Tokenizer
t = Tokenizer()
with open(infile,'r') as f:
data = f.readlines()
p = re.compile('[a-z]+')
p2 = re.compile('[:/.@#Question ●]+')
with open(outfile,'w') as f:
for i in range(len(data)):
line = data[i]
if p2.search(line):
pass
else:
for token in t.tokenize(line):
if p.search(str(token.surface)):
pass
else:
f.write(str(token.surface))
f.write(' ')
f.write('\n')
words = []
for l in open(outfile, 'r', encoding='utf-8').readlines():
if len(l) > 1:
words.append(('<BOP> <BOP> ' + l + ' <EOP>').split())
from nltk.lm import Vocabulary
from nltk.lm.models import MLE
from nltk.util import ngrams
vocab = Vocabulary([item for sublist in words for item in sublist])
print('Vocabulary size: ' + str(len(vocab)))
text_trigrams = [ngrams(word, 3) for word in words]
n = 3
lm = MLE(order = n, vocabulary = vocab)
lm.fit(text_trigrams)
Finally, random sentence generation.
tweets.py
for j in range(10):
# context = ['<BOP>']
context = ['<BOP>','<BOP>']
sentence = ''
for i in range(0, 100):
#Randomly select a non-zero probability of connecting from the last two words in the context
w = lm.generate(text_seed=context)
if '<EOP>' == w or '\n' == w:
break
context.append(w)
sentence += w
print(sentence+'\n')
10 sentences are output at random. When copying and pasting, combine the code written above into one file, or divide it into cells with jupyter notebook and execute it. The latter method is recommended as it may take some time to train the model.
You should see 10 sentences output like the one above. It's pretty interesting so you can try it endlessly. Please try it out.
Recommended Posts