This is the first post of a Qiita article.
I made Shiritori AI using Python. Click here to see the source code https://github.com/takumi13/SiritoriAI
The outline is as follows.
--You can play AI and interactive shiritori with CUI --AI will improve vocabulary by learning words that do not exist in the word dictionary (hereinafter referred to as unknown words) as appropriate. --You can also get an appropriate tweet from Twitter and learn unknown words from that tweet as well. --** Internet environment ** and ** Twitter API ** are required for scraping
At the open campus of the university last year, I asked the visitors to demo-play this as a feature of the laboratory, and it was better than I expected, so I made it an article of Qiita (although it has been quite a while). I decided to see it.
In this article, I will explain how to play Shiritori AI and an overview of the code.
--Environment
> ver
Microsoft Windows [Version 10.0.18362.900]
> python -V
Python 3.7.5
> pip -V
pip 20.1.1
To run the program, you need:
2.1 Python3 Python can be installed from the link below. https://www.python.org/downloads/
For the installation procedure, refer to the link below, for example. How to install Python (Windows)
2.2 TwitterAPI & tweepy In order to use ** TwitterAPI **, you need to apply for TwitterAPI from the Twitter Developper page (https://developer.twitter.com/en).
The application often takes several days, and in my case I got it in April 2019 and it took 4-5 days to complete the procedure.
Please refer to the link below for the acquisition procedure. Summary of procedures from Twitter API registration (account application method) to approval * Information as of August 2019
After completing the above, please install ** tweepy **, which is a convenient library for calling the Twitter API in Python. tweepy
can be easily installed with the pip
command.
> pip install tweepy
> pip show tweepy
Name: tweepy
Version: 3.8.0
Also, if the application for using the Twitter API is passed, You can get the access token information to operate your Twitter account from the API. You can get the following 4 tokens.
Copy these as strings in the appropriate places in config.py
.
When the work up to this point is completed, you can safely call the Twitter API from Python and
You will be able to operate Twitter.
config.py
CONSUMER_KEY = "##############################"
CONSUMER_SECRET = "##############################"
ACCESS_TOKEN = "##############################"
ACCESS_TOKEN_SECRET = "##############################"
By the way, in order to get the Twitter API, it is necessary to exchange emails with Twitter once or twice (** Explain the purpose of using the API in English 200 characters or more **), which is a little troublesome to be honest. .. In my program, the shiritori with AI works without the Twitter API, so application is not a requirement. However, if you do not want to get it, you need to comment out the part related to tweepy
in the program.
2.3 janome
** janome ** is one of many morphological analyzers and is a Python library.
This can also be installed with the pip
command, just like tweepy
.
> pip install janome
> pip show janome
Name: Janome
Version: 0.3.10
How to use janome
will be described later.
The following modules need to be installed, mainly for HTTP requests and scraping.
First of all, about ** requests ** According to How to use Requests (Python Library)
Requests is Python's modern HTTP library. You can make a GET request with
requests.get ('URL')
. You can get the response body in text format by setting.text
to the response.
Next, about ** beautifulsoup4 **
Beautiful Soup is a Python library that retrieves data from HTML and XML files. Use your favorite parser to explore, search, and modify parse trees. This greatly reduces the programmer's work time.
Also, since ** lxml ** is used inside beautifulsoup4
, this also needs to be installed separately.
The above three simple usages are described.
First, install each with the pip
command.
> pip install requests
> pip install beautifulsoup4
> pip install lxml
> pip show requests
Name: requests
Version: 2.23.0
> pip show beautifulsoup4
Name: beautifulsoup4
Version: 4.9.1
> pip show lxml
Name: lxml
Version: 4.3.3
Let's actually write the code.
The code below gets the HTML as text through the request and scrapes the required information from it.
As an example, the explanation page for "words" in Kotobank
(https://kotobank.jp/word/%E5%8D%98%E8%AA%9E)
From, let's scrape the information word (tango)
.
sample_scraping.py
import requests
from bs4 import BeautifulSoup
url = 'https://kotobank.jp/word/%E5%8D%98%E8%AA%9E'
html = requests.get(url).text #Get HTML as text by request module
soup = BeautifulSoup(html, 'html.parser') #Beautiful Soup initialization
real_page_tag = soup.find("title") #Find and store the title tag part
title_read_tmp = real_page_tag.string #Make it a character string<title>, </title>To delete
title_read = title_read_tmp[:-10] #Remove unnecessary parts(Normalization)
print(title_read)
'''
> python sample_scraping.py
word(Tango)
'''
I'm casually stepping into the main subject, but in the program AI generates the URL of the Kotobank page of the word from the word entered by a human, and through the above operations, whether the word actually exists or not Judge. At the same time, the beginning and end of the reading kana of the word are also memorized.
In addition, Kotobank
https://kotobank.jp/word/ [character string converted from word to utf-8]
The explanation page URL of the word is managed in the format.
Therefore, the conversion process to utf-8 can all be coded as follows:
sample_make_url.py
import requests
from bs4 import BeautifulSoup
import binascii
word = 'word'
url_top = 'https://kotobank.jp/word/'
word_byte = word.encode('utf-8')
hex_string = str(binascii.hexlify(word_byte), 'utf-8') #Convert byte string to string
hex_string = hex_string.upper() #Convert to uppercase
words = [] #Store byte string 2 characters at a time
for i in range(len(hex_string)//2): #Number of characters in byte string/Repeat twice
words.append(hex_string[i*2:i*2+2]) #Cut out two letters from the beginning and words[i]Store in
url_latter = "" #The second half of the URL.Utf the character string entered by humans-Converted to 8
for i in range(len(words)):
words[i] = '%' + words[i] #Every two characters%Put on(specification)
url_latter = url_latter + words[i] #Concatenated at the end
url = url_top + url_latter #Completed URL
print("URL : " + url)
'''
> python sample_make_url.py
URL : https://kotobank.jp/word/%E5%8D%98%E8%AA%9E
'''
It's finally the main subject. Let's start with the demo play.
>python main.py
[menu]
I want to shiritori with AI: 1
I want to shiritori with AI(debug mode) :2
I want to learn unknown words from tweets: 3
I want to get tweets from Twitter: 4
Selection: 1
When you want to finish shiritori,Press Enter 3 times in a row(This operation is valid even during the game).
Tell me your name: human
It's a human.Thank you.
Shiritori is from you.Start with your favorite word.
----------------------------------------------------------------------------------------------
Human:Shiritori
AI :Ryugu Castle(Ryugujo)
Human:eel
AI :Gilbert
Human:Urban
AI :Issue
Human:Yukiguni
I'm sorry.That word cannot be used for this shiritori.
If it is a word that can be converted to kanji,Sorry to trouble you,Please convert to Kanji and enter again
Human:Snow country
AI :Patience(Nintai)
Human:vocabulary
Not shiritori
Please enter the words that start with "i"
Human:Hospital
It ended with "n"
It lasted 4 times
I win
----------------------------------------------------------------------------------------------
As mentioned in the previous section, AI will determine the existence of a word through a request to Kotobank. Therefore, words with kanji that have common notations such as sour
will fail to create URLs.
In addition, the number of rallies will be displayed at the end of shiritori.
Finally, if there is an unknown word in the series of shiritori, it will be added to the AI dictionary as a new vocabulary.
At this point, the AI knows 5169 words.
The diagram below shows the flow.
The human victory judgment is omitted from the figure for the sake of simplicity of the explanation, but it is implemented in the program.
I explained in the previous section that AI is getting the existence flag and reading for human input words through a request to Kotobank. On the other hand, AI keeps a list of words it knows (hereinafter, known words) as dictionary.txt
.
The format of the dictionary file is as follows.
dictionary.txt
Ah:Ahさひ,Ahみだくじ,Sogo Umbrella,account,play,Asa,guide,Anime,Aiko,Ad,Overwhelming,Aoyama,Asagiri,Arakida,Compatibility,Red,...
I:Iruma,Or later,that's all,together,Other than,Within,Iizuka,Inoue,Temporary,Izumi,prayer,Iや,Emigration,Ino,Event,Less than,impression,Beverage,colored paper,now,...
...
Pe:pair,Pets,paste,Pay,pliers,Perry,Pen name,page,Pencil,paper
Po:Potato chips,Poster,pocket,Pochette,Porn,Police,point,Pause
The word dictionary can have such a simple structure because it entrusts Kotobank with all the judgment of the existence of words and the understanding of how to read them. By reading this dictionary.txt
as a dictionary type in the program,
d ['A'] [0](= Asahi)
You can access the word like this.
In reality, the arguments for some dictionary-type keys are randomly determined.
It's quite annoying to have a shiritori every time to let the AI learn unknown words.
So, use tweepy
to get some suitable tweets, extract only the words from them, and learn unknown words from them.
> python .\main.py
[menu]
I want to shiritori with AI: 1
I want to shiritori with AI(debug mode) :2
I want to learn unknown words from tweets: 3
I want to get tweets from Twitter: 4
Choice: 3
Number of tweets you want to get(Up to 10):2
----------------------------------------------------------------------------------------------Display part of the acquired tweet
For privacy protection, the acquired tweets are hidden
----------------------------------------------------------------------------------------------[ 0 ]Everyone(Mi)
[ 1 ]fan(U)
[ NG ]No good
[ 2 ]Writing(Chi)
Number of words in the original dictionary: 5179
Number of acquired words: 3
Number of words actually increased: 2
Number of words in the updated dictionary: 5181
Word learning rate: 66.66666666666667 %
----------------------------------------------------------------------------------------------
In this way, only nouns are extracted using janome
from tweets acquired using tweepy
.
We are learning only unknown words from them.
Here, all the relevant words are read through a request to Kotobank.
Making a large number of requests to one website in a short period of time puts a heavy load on the website.
To be on the safe side, the program limits the maximum number of tweets that can be obtained to ** to 10.
The following code shows an example of using tweepy
.
sample_tweepy.py
import tweepy
import config
from janome.tokenizer import Tokenizer
##################################################333
CK = config.CONSUMER_KEY
CS = config.CONSUMER_SECRET
AT = config.ACCESS_TOKEN
ATK = config.ACCESS_TOKEN_SECRET
##################################################333
#--------------------------------------------------
#Get Twitter API
#--------------------------------------------------
def get_twieetr_api(CK, CS, AT, ATK):
try: #Exception handling
auth = tweepy.OAuthHandler(CK, CS) #CONSUMER to auth_KEY and CONSUMER_Pass SECRET
auth.set_access_token(AT, ATK) #Set the access token in auth
API = tweepy.API(auth) #Make writing easier
except tweepy.TweepError as e: #If an error occurs, TweetError will be returned.
print(e.reason) #Output error details
return API
###########################################################################
api = get_twieetr_api(CK, CS, AT, ATK)
############################################################################
#-----------------------------------------------------
#Get the count number of tweets as text using Twitter API
#-----------------------------------------------------
def get_text(q, count=100):
text_list = []
search_results = api.search(q=q, count=count)
for tweet in search_results:
text = tweet.text.encode('cp932', "ignore")
text = text.decode('cp932')
text = text.encode('utf-8', "ignore")
text_list.append(text.decode('utf-8'))
return text_list
text_list = get_text(q='pleasant', count=3)
for text in text_list:
print('-------------------------------------------------')
print(text)
With the above code, you can get the tweet as text.
In addition, the following code shows an example of using janome
.
sample_janome.py
from janome.tokenizer import Tokenizer
t = Tokenizer()
#The following text is a fictitious tweet
text_list = ['Thank you! I will continue to use it carefully!', 'No, I feel like I can't do anything more...Mmm']
for text in text_list:
for token in t.tokenize(text):
if token.part_of_speech.split(',')[0] == 'noun' and len(token.surface) >= 2 and token.reading != '*':
print(token.surface)
'''
> python sample_janome.py
Important
No good
this
that's all
'''
In this way, words with a reading length of 2 or more and nouns can be retrieved from the text.
By temporarily storing the above word token.surface
in a separate list, and then comparing it with the contents of the dictionary type variable that read dictionary.txt
, only the words that do not exist in the dictionary are newly stored. You can learn unknown words from tweets.
that's all. If you describe the system part that actually performs shiritori and the processing such as identification of unknown words here, it will be long. I will omit it this time. If you are interested, please check the source code on GitHub.
I may post it as an article again if I have a chance. In particular, I will write a separate article if I can implement new functions.
Thank you for reading this far.
I will attach the URLs of the major websites that I referred to when creating the program. Thank you very much.
Japanese character string operation, Hiragana judgment, etc. Remrin's python capture diary [Preserved version] A thorough explanation for beginners on how to scrape with Python! Summary of scraping basics at Beautiful Soup [for beginners] [Python] Convert byte array to string The story that I had a hard time opening files other than CP932 (Shift-JIS) encoded on Windows
Recommended Posts