I wanted to investigate the trends of the Japanese Society for Artificial Intelligence (JSAI2020) that I participated in the other day, so I created Word Cloud from the lecture title of the program.
The frequency of appearance of words in a sentence is checked, and the size of the letters is changed according to the frequency. You may have seen a visualization of a lot of muttering words by creating a Word Cloud from tweets.
I downloaded the JSAI2020 conference proceedings from here, opened the file index.html, and copied and pasted the session list into Notepad.
If the program is published in PDF, you can extract the text as follows.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
input_path = './Program.pdf'
output_path = 'Program.txt'
manager = PDFResourceManager()
with open(output_path, "wb") as output:
with open(input_path, 'rb') as input:
with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
interpreter = PDFPageInterpreter(manager, conv)
for page in PDFPage.get_pages(input):
interpreter.process_page(page)
Removes non-lecture information from the text, such as the date and presenter. Since the presenter was surrounded by (), the characters in () were deleted. Regular expressions are used to specify the characters in () and the time / date.
import re
with open('Program.txt', mode='rt', encoding='utf-8') as fo:
Program = fo.read()
#[]To()Conversion to
Program = Program.replace("[", "(")
Program = ProgramP.replace("]", ")")
#()Delete the characters enclosed in(This time[]The above code is required because it cannot be deleted.)
Program = re.sub(r'\([^)]*\)', '', Program)
#Delete time / date
Program = re.sub(r'((0?|1)[0-9]|2[0-3])[:][0-5][0-9]?', '', Program)
Program = re.sub(r'2020([0-1]?[0-9])Month([0-3]?[0-9])Day?', '', Program)
Program = re.sub('Time / venue', '', Program)
Program = re.sub('session', '', Program)
Program = re.sub('Announcement list', '', Program)
Program = re.sub('Venue', '', Program)
with open('Program_new.txt', 'w') as f:
print(Program, file=f)
from matplotlib import pyplot as plt
from wordcloud import WordCloud
with open('Program_new.txt', mode='rt', encoding='utf-8') as fo:
cloud_text = fo.read()
#font_path specifies the Japanese font on your device
word_cloud = WordCloud(width=640, height=480, font_path="/System/Library/AssetsV2/com_apple_MobileAsset_Font6/c7c8e5cb889b80fff0175bf138a7b66c6f027f21.asset/AssetData/ToppanBunkyuMidashiGothicStdN-ExtraBold.otf").generate(cloud_text)
word_cloud.to_file('wordcloud4.png')
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
Recommended Posts