Introduction

I wanted to investigate the trends of the Japanese Society for Artificial Intelligence (JSAI2020) that I participated in the other day, so I created Word Cloud from the lecture title of the program.

What is Word Cloud?

The frequency of appearance of words in a sentence is checked, and the size of the letters is changed according to the frequency. You may have seen a visualization of a lot of muttering words by creating a Word Cloud from tweets.

Preparation of text

I downloaded the JSAI2020 conference proceedings from here, opened the file index.html, and copied and pasted the session list into Notepad.

If the program is published in PDF, you can extract the text as follows.


from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

input_path = './Program.pdf'
output_path = 'Program.txt'

manager = PDFResourceManager()

with open(output_path, "wb") as output:
    with open(input_path, 'rb') as input:
        with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
            interpreter = PDFPageInterpreter(manager, conv)
            for page in PDFPage.get_pages(input):
                interpreter.process_page(page)

Delete unnecessary information

Removes non-lecture information from the text, such as the date and presenter. Since the presenter was surrounded by (), the characters in () were deleted. Regular expressions are used to specify the characters in () and the time / date.

import re

with open('Program.txt', mode='rt', encoding='utf-8') as fo:
    Program = fo.read()

#[]To()Conversion to
Program = Program.replace("[", "(")
Program = ProgramP.replace("]", ")")
#()Delete the characters enclosed in(This time[]The above code is required because it cannot be deleted.)
Program = re.sub(r'\([^)]*\)', '', Program)
#Delete time / date
Program = re.sub(r'((0?|1)[0-9]|2[0-3])[:][0-5][0-9]?', '', Program)
Program = re.sub(r'2020([0-1]?[0-9])Month([0-3]?[0-9])Day?', '', Program)
Program = re.sub('Time / venue', '', Program)
Program = re.sub('session', '', Program)
Program = re.sub('Announcement list', '', Program)
Program = re.sub('Venue', '', Program)

with open('Program_new.txt', 'w') as f:
  print(Program, file=f)

Creating a Word Cloud

from matplotlib import pyplot as plt
from wordcloud import WordCloud

with open('Program_new.txt', mode='rt', encoding='utf-8') as fo:
    cloud_text = fo.read()

#font_path specifies the Japanese font on your device
word_cloud = WordCloud(width=640, height=480, font_path="/System/Library/AssetsV2/com_apple_MobileAsset_Font6/c7c8e5cb889b80fff0175bf138a7b66c6f027f21.asset/AssetData/ToppanBunkyuMidashiGothicStdN-ExtraBold.otf").generate(cloud_text)
word_cloud.to_file('wordcloud4.png')

plt.imshow(word_cloud)
plt.axis('off')
plt.show()

Create a Word Cloud from an academic program

Introduction

What is Word Cloud?

Preparation of text

Delete unnecessary information

Creating a Word Cloud