I've been sloppy about studying crawling and scraping during my free spring break as a college student, but I decided to make a Twitter bot because I wanted to make something here. With the help of Friend B, I created a bot that analyzes Friend B's tweets and makes Friend B-like tweets.
Using Ubuntu virtual environment with virtualbox on Windows 10 Ubuntu 18.04.4 LTS
--Markov chain rule
I will write a little explanation for each
When I investigated by asking, "Well, how do you make the sentences to be generated look like that person? How do you do the automatic generation of sentences in the first place?", I found a Markov chain rule.
Please refer to this article for the Markov chain rule, which is also written in an easy-to-understand manner. I also referred to the Python code. ** Markov chain rule is amazing! ** It will be.
[Python] Generate sentences with N-th floor Markov chain](https://qiita.com/k-jimon/items/f02fae75e853a9c02127)
MeCab We use MeCab as a means to analyze the retrieved tweets. The basic flow of sentence generation is "to perform morphological analysis on the acquired tweets using MeCab, create a list of words, and arrange the words according to the Markov chain rule".
Click here for a detailed explanation of MeCab.
tweepy To operate Twitter automatically, use the Twitter API officially provided by Twitter. The Twitter API allows you to programmatically operate your account.
When operating the Twitter API from Python, tweepy, which is a Python library, is convenient, so I learned how to use it while searching on the net while referring to this book.
[Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis-](https://www.amazon.co.jp/Python%E3%82%AF%E3%83%AD%E3%83%] BC% E3% 83% AA% E3% 83% B3% E3% 82% B0-% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94 % E3% 83% B3% E3% 82% B0-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E5% 8F% 8E% E9% 9B% 86% E3% 83% BB% E8% A7% A3% E6% 9E% 90% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E5% AE% 9F% E8% B7% B5% E9% 96% 8B% E7% 99% BA% E3% 82% AC% E3% 82% A4% E3% 83% 89-% E5% 8A% A0% E8% 97% A4-% E8% 80% 95% E5% A4% AA / dp / 4774183679 / ref = tmm_other_meta_binding_title_0? _Encoding = UTF8 & qid = & sr =)
MySQL
Anyway, I decided to study the database, so I decided to use MySQL to save the data acquired by the Twitter API. Using MySQL has the advantage that it is easy to extend the application later. At the time of trial production, I tried to put the data in a text file, but it seems that the operation was a little slow.
If you want to know more about MySQL, please refer to the link below. I think you can get an image.
VPS I rented a VPS because I needed a server to run the bot. The image is that you can use one computer that is always running.
Building the necessary environment required knowledge of Linux, which was a rather hard task for me as a server beginner, but I installed the necessary software such as MySQL and MeCab.
I was able to easily borrow it at CohoHa VPS. I will use it as a web server soon. It's cheap if you can play on the server for 880 yen a month ...
At last it is the main subject, but it looks like this in the figure.
The role of each script is as follows @ twitter_collector.py (script for data collection)
@ bot.py (Bot body)
I think that it is easier to manage by separating the part that collects data and the bot body that tweets.
We prepared the environment of CentOS 8 on the server, installed each necessary tool such as MeCab, MySQL, Python on it and built the environment.
The bot is run by automatically executing a shell script for executing Python scripts by systemd. Systemd is explained in an easy-to-understand manner in this article.
It was fun because it produced quite interesting tweets just by making sentences according to Markov chain rules, but at the moment I am devising the following two points because I want to make it more similar.
--Check the distribution of the person's tweet interval pattern for each time zone, and let them tweet accordingly. --Investigate the distribution of the number of characters in your tweet and determine (the range) of the number of characters to tweet according to it.
Both are achieved by getting data from tweets actually made by Friend B, listing them, and randomly selecting values from them.
As expected, the longer the sentence, the less the meaning of the sentence becomes, but if it is a short tweet, you may tweet as if you really said it, and words are randomly selected from the contents that the person tweeted in the past. Therefore, depending on the combination, I was able to generate tweets that would make me laugh.
I think that if the icons are exactly the same, it may not be possible to distinguish between the bot and the person himself, and I want to do a Turing test.
Friend B, thank you for your cooperation.
I wondered if I could do it with machine learning, but let's call it the next task ... Spring break is over ... It's an online class.
Thank you for reading to the end. If you have any advice, I would appreciate it if you could comment.
Recommended Posts