3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko

A corpus (a large amount of text) is required for natural language processing trials.
Frequently used "Aozora Bunko" is a library on the Internet that publishes texts of works whose copyright has expired, such as modern literature.
Obtain a work from "Aozora Bunko" as a material for natural language processing, and organize the procedure for processing it for a corpus.

1. Get the file and extract only the text

⑴ Import of various modules

import re
import zipfile
import urllib.request
import os.path
import glob

re: Abbreviation for Regular Expression, a module for manipulating regular expressions
zipfile: Module for manipulating zip files
ʻUrllib.request`: Module for retrieving resources on the internet
ʻOs.path`: Module for manipulating pathnames
glob: Module to get the file path name

⑵ Get file path

Here, Kenji Miyazawa's "Night on the Galactic Railroad" is used as the material.

Search by entering "Kenji Miyazawa" in the search box at the top right of the top page of "Aozora Bunko".
Transition to the relevant page from "List of works by artist: Kenji Miyazawa" at the top of the search results.
Select "59. Night on the Galactic Railroad (new character new pseudonym, work ID: 43737)" from the list.
Scroll down the transition destination "Book Card: No.43737" to "File Download".
Right-click ** zip file name ** in the file name (link) field and select "Copy link address".

URL = 'https://www.aozora.gr.jp/cards/000081/files/43737_ruby_19028.zip'

⑶ Method to get / decompress zip file

def download(URL):
    zip_file = re.split(r'/', URL)[-1] #➀
    urllib.request.urlretrieve(URL, zip_file) #➁
    dir = os.path.splitext(zip_file)[0] #➂

    with zipfile.ZipFile(zip_file) as zip_object: #➃
        zip_object.extractall(dir) #➄

    os.remove(zip_file) #➅

    path = os.path.join(dir,'*.txt') #➆
    list = glob.glob(path) #➇
    return list[0] #➈

** 1) Download zip file **

➀ re.split (): Separate the URL string with / and get the zip file name "43737_ruby_19028.zip" at the end.
➁ ʻurllib.request.urlretrieve (URL, save name) `: Download the file directly from the site and save it with the zip file name" 43737_ruby_19028.zip ".
➂ ʻos.path.splitext () : Divide the zip file name with a dot". "And get the file name dir` without the extension.

** 2) Unzip and save the zip file **

➃ zipfile.ZipFile (): Read the previously saved zip file, create a zip object,
➄ ʻextractall (): Extract all the contents of the zip object to the directory dir`.
➅ ʻos.remove () `: Delete the zip file before decompression.

** 3) Get the path of the saved file **

➆ ʻos.path.join () : Generates the path string of dir`.
➇ glob.glob (): Outputs all text file names in the directory and lists them.
➈list [0]: Returns the path of the first file in the list.

⑷ Method to read file and extract body

def convert(download_text):
    data = open(download_text, 'rb').read() #➀
    text = data.decode('shift_jis') #➁

    #Text extraction
    text = re.split(r'\-{5,}', text)[2] #➂  
    text = re.split(r'Bottom book:', text)[0] #➃
    text = re.split(r'[#New Page]', text)[0] #➄

    #Noise removal
    text = re.sub(r'《.+?》', '', text) #➅
    text = re.sub(r'［＃.+?］', '', text) #➆
    text = re.sub(r'｜', '', text) #➇
    text = re.sub(r'\r\n', '', text) #➈
    text = re.sub(r'\u3000', '', text) #➉   

    return text

** 1) Read file **

① ʻopen (file name,'rb'). Read (): Read the file in 'rb'` (binary mode).
② decode ('shift_jis'): Decode according to shift_jis and get the text.

** 2) Extracting the text with re.split () **

➂(r'\-{5,}', text) [2]: Delete the part where the hyphen"-" is repeated 5 times or more, and use this as the delimiter for the third element Take out.
④ (r'base:', text) [0]: Delete "base:" and take out the first element divided by using this as a delimiter.
➄(r'[# page break]', text) [0]: Delete" [# page break] "and take out the first element divided by using this as a delimiter.

** 3) Noise removal (replacement) by re.sub () **

➅'《. +?》': 《Ruby》
➆'[#. +?]': [Note]
➇ '｜': Start position of character string with ruby
➈'\ r \ n': Line feed code
➉'\ u3000': Full-width space

⑸ File acquisition and text extraction

download_file = download(URL)
text = convert(download_file)

print(text)

2. "Separate writing" by MeCab

⑹ Installation of MeCab, word-separation

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

You can get the result as a string by creating an instance in the class MeCab.Tagger () with the argument -Owakati and then calling the methodparse ().

import MeCab
mecab = MeCab.Tagger("-Owakati")
text = mecab.parse(text)

print(text)

In addition, split () splits the string with spaces as delimiters.

separated_text = text.split()
print(separated_text)

3. If you download to your local PC

⑺ File and get to local PC

Download the word-separated text to your local PC.

with open('output.txt', 'w') as f:
    f.write(text)

Write text to a file called'output.txt'. The argument 'w' is the write mode specification.

from google.colab import files

files.download('output.txt')

files is a module for uploading or downloading files between Colaboratory and your local PC.
Indicates the text file after download. Unnecessary parts such as ruby and footnotes in the text have been removed, leaving only the text in separate words.