Introduction

Aozora Bunko is a website that volunteers to digitize and publish literary works that are out of copyright. Currently, for engineers who use Aozora Bunko data, ** text format and HTML format of all data are uploaded daily to Aozora Bunko's Github, and batch download is also possible ** It has become.

If you want to use the data of Aozora Bunko in natural language processing, there is a fixed format in the text data, so it takes a little time from data collection to preprocessing only for a specific writer. So I wrote ** a batch of downloads for a specific writer + a Python script that can be easily preprocessed **, so I decided to leave it in Qiita.

Goal

--Batch download only data for specific writers in Aozora Bunko --Preprocess the downloaded text file and save it as a UTF-8 TSV file. --Save as TSV with the text body in the first column and the work name in the second column --Text formatting is as follows

item name	Before plastic surgery (example)	After plastic surgery (example)
Notes and bibliographic information	"In the text" "Bottom:" "* [#" 勹 <evening ", 3rd level 1-14-76］」	(Delete)
Clause delimiter	「―――」「×××」「＊＊＊」	(Delete)
Literary symbols	Dash "-", three-point reader "...", rice "*"	(Delete)
Lines of 1 character or less, blank lines	"Three" (numbers separated by clauses, etc.)	(Delete)
Ruby notation	"Tokyo	Express train for Yuki"
Indentation (double-byte space at the beginning of the line)	"In the study,"	"In the study,"

Collectively download only data for specific writers

The easiest way to download data for a particular author in bulk is to use the ** svn command **. The URL is https://github.com/aozorabunko/aozorabunko/trunk/cards/ {author ID}. {Writer ID} is a 6-digit number that appears on the URL when you access the page of a specific writer on the Aozora Bunko website. For example, in the case of Ryunosuke Akutagawa, it is "000879".

You can download all files of a specific writer (for example, Ryunosuke Akutagawa) by specifying the URL with svn export as shown below. By the way, svn should work standard on Linux and Mac. I don't know if it's Windows, but even if you're a Windows user, it's easy to start Ubuntu with WSL and type commands **.

`Download all data of Ryunosuke Akutagawa`


svn export https://github.com/aozorabunko/aozorabunko/trunk/cards/000879/

This will create a local ./000879/ directory.

Pre-processing (text formatting + saving)

With the following Python script, the downloaded ZIP file is batch text-formatted + preprocessed and saved as TSV. The outline of the process is as follows. Each process is as described in the comments. It was okay to classify it, but for the time being, it works only with functions.

--Search all zip files under a specific directory and store them in the list --Create output directory --Loop processing for each file in list order (for) --Read ZIP-compressed txt as Pandas DataFrame: save_cleanse_text () --Convert original data to UTF-8 and save as text file --Text formatting: text_cleanse_df () --Save as the work name in the second column

`aozora_preprocess.py`


import pandas as pd
from pathlib import Path

author_id = '000879'  #Aozora Bunko writer number
author_name = 'Ryunosuke Akutagawa'  #Writer name in Aozora Bunko notation

write_title = True  #Do you put the title of the work in the second column?
write_header = True  #Whether the first line is the column name (column name "text" "title")
save_utf8_org = True  #UTF original data-Whether to save the text file set to 8

out_dir = Path(f'./out_{author_id}/')  #File output destination
tx_org_dir = Path(out_dir / './org/')  #UTF of original text-8 Save destination of conversion file
tx_edit_dir = Path(out_dir / './edit/')  #File save destination after text formatting


def text_cleanse_df(df):
    #Find the beginning of the text ('---…'Premise that the text starts immediately after the break)
    head_tx = list(df[df['text'].str.contains(
        '-------------------------------------------------------')].index)
    #Find the end of the text ('Bottom book:'Premise that the text ends just before)
    atx = list(df[df['text'].str.contains('Bottom book:')].index)
    if head_tx == []:
        #if'---…'If there is no delimiter, the premise that the text starts immediately after the writer's name
        head_tx = list(df[df['text'].str.contains(author_name)].index)
        head_tx_num = head_tx[0]+1
    else:
        #2nd'---…'The text starts immediately after the break
        head_tx_num = head_tx[1]+1
    df_e = df[head_tx_num:atx[0]]

    #Deleted Aozora Bunko format
    df_e = df_e.replace({'text': {'《.*?》': ''}}, regex=True)
    df_e = df_e.replace({'text': {'［.*?］': ''}}, regex=True)
    df_e = df_e.replace({'text': {'｜': ''}}, regex=True)

    #Removed indentation (double-byte space at the beginning of the line)
    df_e = df_e.replace({'text': {'　': ''}}, regex=True)

    #Remove clause breaks
    df_e = df_e.replace({'text': {'^.$': ''}}, regex=True)
    df_e = df_e.replace({'text': {'^―――.*$': ''}}, regex=True)
    df_e = df_e.replace({'text': {'^＊＊＊.*$': ''}}, regex=True)
    df_e = df_e.replace({'text': {'^×××.*$': ''}}, regex=True)

    #Remove symbols and parentheses left by deleting symbols
    df_e = df_e.replace({'text': {'―': ''}}, regex=True)
    df_e = df_e.replace({'text': {'…': ''}}, regex=True)
    df_e = df_e.replace({'text': {'※': ''}}, regex=True)
    df_e = df_e.replace({'text': {'「」': ''}}, regex=True)

    #Delete lines consisting of one character or less
    df_e['length'] = df_e['text'].map(lambda x: len(x))
    df_e = df_e[df_e['length'] > 1]

    #The index will shift, so re-roll
    df_e = df_e.reset_index().drop(['index'], axis=1)

    #Remove blank lines (just in case)
    df_e = df_e[~(df_e['text'] == '')]

    #Since the index shifts, re-roll and delete the character length column
    df_e = df_e.reset_index().drop(['index', 'length'], axis=1)
    return df_e


def save_cleanse_text(target_file):
    try:
        #Read file
        print(target_file)
        #Read as Pandas DataFrame (variants cannot be read unless read with cp932)
        df_tmp = pd.read_csv(target_file, encoding='cp932', names=['text'])
        #UTF original data-Convert to 8 and save text file
        if save_utf8_org:
            out_org_file_nm = Path(target_file.stem + '_org_utf-8.tsv')
            df_tmp.to_csv(Path(tx_org_dir / out_org_file_nm), sep='\t',
                          encoding='utf-8', index=None)
        #Text formatting
        df_tmp_e = text_cleanse_df(df_tmp)
        if write_title:
            #Make a title column
            df_tmp_e['title'] = df_tmp['text'][0]
        out_edit_file_nm = Path(target_file.stem + '_clns_utf-8.txt')
        df_tmp_e.to_csv(Path(tx_edit_dir / out_edit_file_nm), sep='\t',
                        encoding='utf-8', index=None, header=write_header)
    except:
        print(f'ERROR: {target_file}')


def main():
    tx_dir = Path(author_id + './files/')
    #Create a list of zip files
    zip_list = list(tx_dir.glob('*.zip'))
    #Create a save directory
    tx_edit_dir.mkdir(exist_ok=True, parents=True)
    if save_utf8_org:
        tx_org_dir.mkdir(exist_ok=True, parents=True)

    for target_file in zip_list:
        save_cleanse_text(target_file)


if __name__ == '__main__':
    main()

Execution result

`100_ruby_1154_org_utf-8.txt (original data)`


Momotaro
Ryunosuke Akutagawa
-------------------------------------------------------
[About the symbols that appear in the text]
"":ruby
(Example) Peach << Momo >>
｜: Symbol that identifies the beginning of a character string with ruby
(Example) Heaven and Earth | Around the time of the creation myth "Koro" Hey
[#]: Enterer's note: Mainly explanation of external characters and designation of emphasis marks
(Numbers are JIS X 0213 area area number or Unicode, base page and number of lines)
(Example) * [# "Word + Making a mound", Level 4 2-88-74］
-------------------------------------------------------
[# 8 indentation] 1 [# "1" is the middle heading]
Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. (...)
（…）
What kind of person picked up this baby "Akaji" after leaving the depths of the deep mountains? ――I don't need to talk about it anymore. At the end of Tanigawa, there was one grandmother, as the children of Japanese people all over Japan know, the kimono of the old man who went to Shibaka. I was washing it. ......
（…）

`100_ruby_1154_clns_utf-8.tsv (preprocessed data)`


text	title
Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. (...) Momotaro
（…）
What kind of person picked up this baby after leaving the depths of the deep mountains? I don't need to talk about it anymore. At the end of Tanigawa, an old woman was washing the kimono or something of an old man who went to mow the bush, as children all over Japan know. Momotaro
（…）

The parts that are not needed to be processed as natural language have been successfully deleted. I hope you can use it as a class or add other items you want to process.

Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python

Introduction

Goal

Collectively download only data for specific writers

Download all data of Ryunosuke Akutagawa

Pre-processing (text formatting + saving)

aozora_preprocess.py

Execution result

100_ruby_1154_org_utf-8.txt (original data)

100_ruby_1154_clns_utf-8.tsv (preprocessed data)

`Download all data of Ryunosuke Akutagawa`

`aozora_preprocess.py`

`100_ruby_1154_org_utf-8.txt (original data)`

`100_ruby_1154_clns_utf-8.tsv (preprocessed data)`